01722nas a2200133 4500008004100000245009200041210006900133260005500202520120200257653000801459653001601467100001801483856008701501 2017 eng d00aA Comparison of Three Empirical Reliability Estimates for Computerized Adaptive Testing0 aComparison of Three Empirical Reliability Estimates for Computer aNiigata, JapanbNiigata Seiryo Universityc08/20173 a
Reliability estimates in Computerized Adaptive Testing (CAT) are derived from estimated thetas and standard error of estimated thetas. In practical, the observed standard error (OSE) of estimated thetas can be estimated by test information function for each examinee with respect to Item response theory (IRT). Unlike classical test theory (CTT), OSEs in IRT are conditional values given each estimated thetas so that those values should be marginalized to consider test reliability. Arithmetic mean, Harmonic mean, and Jensen equality were applied to marginalize OSEs to estimate CAT reliability. Based on different marginalization method, three empirical CAT reliabilities were compared with true reliabilities. Results showed that three empirical CAT reliabilities were underestimated compared to true reliability in short test length (< 40), whereas the magnitude of CAT reliabilities was followed by Jensen equality, Harmonic mean, and Arithmetic mean in long test length (> 40). Specifically, Jensen equality overestimated true reliability across all conditions in long test length (>50).
Session Video
10aCAT10aReliability1 aSeo, Dong, Gi uhttps://drive.google.com/file/d/1gXgH-epPIWJiE0LxMHGiCAxZZAwy4dAH/view?usp=sharing02531nas a2200205 4500008004100000245011900041210006900160260001200229520179400241653000802035653000802043653003402051653003002085653000802115653003102123653001602154653001302170100002002183856012202203 2011 eng d00aFrom Reliability to Validity: Expanding Adaptive Testing Practice to Find the Most Valid Score for Each Test Taker0 aFrom Reliability to Validity Expanding Adaptive Testing Practice c10/20113 aCAT is an exception to the traditional conception of validity. It is one of the few examples of individualized testing. Item difficulty is tailored to each examinee. The intent, however, is increased efficiency. Focus on reliability (reduced standard error); Equivalence with paper & pencil tests is valued; Validity is enhanced through improved reliability.
How Else Might We Individualize Testing Using CAT?
-
By addressing construct-irrelevant factors influencing individual test scores (usually in negatively biased ways).
-
Individual Score Validity (ISV) – how free is a particular score from construct-irrelevant factors (often called construct-irrelevant variance, or CIV).
An ISV-Based View of Validity
Test Event -- An examinee encounters a series of items in a particular context.
-
•All 3 elements (examinee, items, context) are potential sources of CIV.
-
Examples:
-
Test anxiety (examinee)
-
Amount/difficulty of reading required (item)
-
Test stakes (context)
-
ISV can be affected by all 3 elements.
CAT Goal: individualize testing to address CIV threats to score validity (i.e., maximize ISV).
Some Research Issues:
-
What are some innovative methods for expanding CAT that address ISV threats while preserving measurement of the target construct?
-
How might CAT help address the ISV challenges posed by test anxiety?
-
How should policy-makers deal with scores that have been shown to have low ISV?
10aCAT10aCIV10aconstruct-irrelevant variance10aIndividual Score Validity10aISV10alow test taking motivation10aReliability10avalidity1 aWise, Steven, L uhttp://www.iacat.org/content/reliability-validity-expanding-adaptive-testing-practice-find-most-valid-score-each-test02315nas a2200205 4500008004100000245012100041210006900162300000800231490000700239520159200246653002101838653003001859653002201889653001801911653001601929653000901945653001801954100001401972856012301986 2000 eng d00aAn examination of the reliability and validity of performance ratings made using computerized adaptive rating scales0 aexamination of the reliability and validity of performance ratin a5700 v613 aThis study compared the psychometric properties of performance ratings made using recently-developed computerized adaptive rating scales (CARS) to the psyc hometric properties of ratings made using more traditional paper-and-pencil rati ng formats, i.e., behaviorally-anchored and graphic rating scales. Specifically, the reliability, validity and accuracy of the performance ratings from each for mat were examined. One hundred twelve participants viewed six 5-minute videotape s of office situations and rated the performance of a target person in each vide otape on three contextual performance dimensions-Personal Support, Organizationa l Support, and Conscientious Initiative-using CARS and either behaviorally-ancho red or graphic rating scales. Performance rating properties were measured using Shrout and Fleiss's intraclass correlation (2, 1), Borman's differential accurac y measure, and Cronbach's accuracy components as indexes of rating reliability, validity, and accuracy, respectively. Results found that performance ratings mad e using the CARS were significantly more reliable and valid than performance rat ings made using either of the other formats. Additionally, CARS yielded more acc urate performance ratings than the paper-and-pencil formats. The nature of the C ARS system (i.e., its adaptive nature and scaling methodology) and its paired co mparison judgment task are offered as possible reasons for the differences found in the psychometric properties of the performance ratings made using the variou s rating formats. (PsycINFO Database Record (c) 2005 APA )10aAdaptive Testing10aComputer Assisted Testing10aPerformance Tests10aRating Scales10aReliability10aTest10aTest Validity1 aBuck, D E uhttp://www.iacat.org/content/examination-reliability-and-validity-performance-ratings-made-using-computerized-adaptive