June 2 and 3, 2009

Note: Abstracts and contact information for unavailable papers follow the conference schedule

Tuesday, 2 June

8:30 – 9:00 Welcomes: Larry Rudner, GMAC, and Dave Weiss, University of Minnesota

9:00 ­– 10:15 Realities of CAT: Dave Weiss, University Minnesota, Chair

Effect of Early Misfit in Computerized Adaptive Testing on the Recovery of Theta. Rick Guyer and David J. Weiss, University of Minnesota

Quantifying the Impact of Compromised Items in CAT. Fanmin Guo, Graduate Management Admission Council

Guess What? Score Differences With Rapid Replies Versus Omissions on a Computerized Adaptive Test. Eileen Talento-Miller and Fanmin Guo, Graduate Management Admission Council

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss, University of Minnesota

 

10:30 – 12:00 CAT for Classification: Dave Weiss, University of Minnesota

Computerized Classification Testing in More Than Two Categories by Using Stochastic Curtailment. Theo J.H.M. Eggen, CITO and University of Twente, The Netherlands and Jasper T. Wouda, CITO, Arnhem, The Netherlands

Utilizing the Generalized Likelihood Ratio as a Termination Criterion. Nathan A. Thompson, Assessment Systems Corporation

Adaptive Testing Using Decision Theory. Lawrence M. Rudner, Graduate Management Admission Council

“Black Box" Adaptive Testing by Mutual Information and Multiple Imputations. Anne Thissen-Roe, Kronos A Comparison of Computerized Adaptive Testing Approaches: Real-data Simulations of - and Non-IRT-based CAT with Personality Measures. Monica M. Rudick, Wern How Yam, and Leonard Simms, University at Buffalo

12:30 – 2:00 Posters: CAT Research and Applications Around the World (Concurrent)

A Comparison of Three Methods of Item Selection for Computerized Adaptive Testing. Denise Reis Costa and  Camila Akemi Karino, CESPE/University of Brasilia; Fernando A. S. Moura, Federal University of Rio de Janeiro;   and Dalton F. Andrade, Federal University of Santa Catarina.

Adequacy of an Item Pool for Proficiency in English Language From the University of Brasilia for Implementation of a CAT Procedure. Camila Akemi Karino, Denise Reis Costa, and Jacob Arie Laros, CESPE/University of Brasilia

Development of an Item Model Taxonomy for Automatic Item Generation in Computerized Adaptive Testing. Hollis Lai, Mark J. Gierl, and Cecilia Alves, University of Alberta, Canada

An Approach to Implementing Adaptive Testing Using Item Response Theory in a Paper-Pencil Mode. V. Natarajan, MeritTrac Services Pvt. Ltd, India

Assessing the Equivalence of Internet-Based vs. Paper-and-Pencil Psychometric Tests. Naomi Gafni, Keren Roded, and Michal Baumer, National Institute for Testing and Evaluation, Israel

Features of a CAT System and Its Application to J-CAT. Shingo Imai, Y. Akagi, Yamaguchi University, K. Kikuchi, Toho University, S. Ito, TUFS, Y. Nakamura, Tokiwa University, H. Nakasono, Shimane University, A. Honda, APU, and T. Hiramura, TIT, Japan Adaptive Measurement of Cognitive Ability Based on a Person’s Zone of Nearest Development. Marina Chelyshkova and Victor Zvonnikov, State University of Management, Russia Implementing Figural Matrix Items in a Computerized Adaptive Testing System: Singapore’s Experience. Poh Hua Tay and Raymond Fong, Ministry of Education, Singapore

Constrained Item Selection Using a Stochastically Curtailed SPRT. Jasper T. Wouda, and Theo J. H. M. Eggen, The Netherlands Using Enhanced Effective Response Time to Detect the Extent and Track the Trend of Item Pre-knowledge on a Large-Scale Computer Adaptive Assessment. Jie Li and Xiang Bo Wang, ACT, Inc., U.S.A. Computerized Adaptive Testing for the Singapore Employability Skills System (ESS). Patricia Rickard, CASAS, James B. Olsen, Alpine Testing Solutions, Debalina Ganguli, and Richard Ackermann, Team Code, Inc. U.S.A. Criterion-Related Validity of an Innovative CAT-Based Personality Measure. Robert J. Schneider, Richard A. McLellan, PreVisor, Inc., Tracy M. Kantrowitz, PreVisor, Inc., Janis S. Houston,PDRI, Walter C. Borman, PDRI, U.S.A.

1:00 – 1:40 CAT in Spain and Israel(Concurrent With Poster Session): Dave Weiss, University of Minnesota, Chair

Computerized Adaptive Testing in Spain: Description, Item parameter Updating and Future Trends of eCAT. Francisco J. Abad, Universidad Autónoma de Madrid, David Aguado, Universidad Autónoma de Madrid, Juan Ramón Barrada, Universidad Autónoma de Barcelona, Julio Olea, Universidad Autónoma de Madrid, Vicente Ponsoda, Universidad Autónoma de Madrid, Spain

Twenty-Two Years of Applying CAT for Admission to Higher Education in Israel Naomi Gafni and Yoav Cohen, National Institute for Testing and Evaluation, Jerusalem, Israel

2:00 – 3:15 Concurrent Sessions

Item Selection: Larry Rudner, GMAC, Chair

Item Selection and Hypothesis Testing for the Adaptive Measurement of Change. Matthew Finkelman, Tufts University School of Dental Medicine, David J. Weiss, University of Minnesota, and Gyenam Kim-Kang, Korea Nazarene University

A Gradual Maximum Information Ratio Approach to Item Selection in Computerized Adaptive Testing. Kyung (Chris) T. Han, Graduate Management Admission Council

Item Selection With Biased-Coin Up-and-Down Designs. Yanyan Sheng, Southern Illinois University at Carbondale

A Burdened CAT: Incorporating Response Burden with Maximum Fisher’s Information for Item Selection. Richard J. Swartz, The University of Texas, D. Anderson Seung W. Choi, Northshore University Health System Research Institute and Northwestern University Real-Time Analysis: Fanmin Guo, GMAC, Chair

Adaptive Item Calibration: A Simple Process for Estimating Item Parameters Within a Computerized Adaptive Test. G. Gage Kingsbury, Northwest Evaluation Association On the Fly Item Calibration in Low Stakes CAT Procedures. Sharon Klinkenberg, Department of Psychology, University of Amsterdam, Marthe Straatemeier, Department of Psychology, University of Amsterdam, Gunter Maris, CITO, and Han van der Maas, Department of Psychology, University of Amsterdam

An Automatic Online Calibration Design in Adaptive Testing. Guido Makransky, University of Twente/ Master Management International A/S and Cees A. W. Glas, University of Twente Investigating Cheating Effects on the Conditional Sympson and Hetter Online Procedure with Freeze Control for Testlet-based Items. Ya-Hui Su, University of California, Berkeley Government Supported CAT Programs and Projects: Dave Weiss, University of Minnesota, Chair Department of Defense

The Nine Lives of CAT-ASVAB: Innovations and Revelations. Mary Pommerich, Daniel O. Segall, and Kathleen E. Moreno, Defense Manpower Data Center National Institutes of Health Development of a Comprehensive CAT-Based Instrument for Measuring Depression. Robert D. Gibbons, University of Illinois at Chicago Development of a CAT to Measure Dimensions of Personality Disorder: The CAT-PD Project. Leonard J. Simms, University of Buffalo The MEDPRO Project:An SBIR Project for a Comprehensive IRT and CAT Software System

IRT Software: David Thissen,The University of North Carolina at Chapel Hill and Scientific Software International CAT Software: Nathan Thompson, Assessment Systems Corporation

Wednesday, 3 June

8:15 – 9:25 Concurrent Sessions

Item Exposure: Larry Rudner, GMAC, Chair

Reviewing Test Overlap Rate and Item Exposure Rate as Indicators of Test Security in CATs. Juan Ramón Barrada, Universidad Autónoma de Barcelona; Julio Olea, Vicente Ponsoda, and Francisco J. Abad, Universidad Autónoma de Madrid.

Optimizing Item Exposure Control and Test Termination Algorithm Pairings for Polytomous Computerized Adaptive Tests With Restricted Item Banks. Michael Chajewski and Charles Lewis, Fordham University

Limiting Item Exposure for Key-Difficulty Ranges in a High-Stakes CAT. Xin Li, Kirk A. Becker, Jerry L. Gorham, Pearson VUE; and Woo, National Council of State Boards of Nursing Multidimensional CAT: Nate Thompson, Assessment Systems Corporation, Chair

Comparison of Adaptive Bayesian Estimation and Weighted Bayesian Estimation in Multidimensional Computerized Adaptive Testing. Po-Hsi Chen, Taiwan Normal University Comparison of Ability Estimation and Item Selection Methods in Multidimensional Computerized Adaptive Testing. Qi Diao, Michigan State University and CTB/McGraw-Hill and Mark Reckase, Michigan University

Multidimensional Adaptive Testing: The Application of Kullback-Leibler Information. Chun Wang and Hua-Hua Chang, University of Illinois at Urbana-Champaign Multidimensional Adaptive Personality Assessment: A Real-Data Confirmation. Alan D. Mead, Avi Fleischer, and Jessica D. Sergent, Illinois Institute of Technology

9:35 – 10:45 Item and Pool Development: Larry Rudner, GMAC, Chair

A Comparison of Three Procedures For Computing Information Functions For Bayesian Scores From Computerized Adaptive Tests Kyoko Ito, Human Resources Research Organization, Mary Pommerich, Defense Manpower Data Center, and Daniel O. Segall, Defense Manpower Data Center

Adaptive Computer-Based Tasks Under an Assessment Engineering Paradigm. Richard M. Luecht, The University of North Carolina at Greensboro

Developing Item Variants: An Empirical Study. Anne Wendt, National Council of State Boards of Nursing, Shu-chuan Kao, Pearson VUE, Jerry Gorham , Pearson VUE, and Ada Woo, National Council of State Boards of Nursing

Evaluation of a Hybrid Simulation Procedure for the Development of Computerized Adaptive Tests. Steven W. Nydick and David J. Weiss, University of Minnesota

11:00 - 11:55 Diagnostic Testing: Larry Rudner, GMAC, Chair

Computerized Adaptive Testing for Cognitive Diagnosis. Ying Cheng, University of Notre Dame

Obtaining Reliable Diagnostic Information through Constrained CAT. Hua-Hua Chang, Jeff Douglas and Chun Wang, University of Illinois

Applying the DINA model to GMAT Focus Data. Alan Huebner, Xiang Bo Wang, and Sung Lee, ACT,Inc.

11:55 - 12:30 Wrap-Up and Future Directions: Larry Rudner and Dave Weiss

Abstracts

A Comparison of Computerized Adaptive Testing Approaches: Real-data Simulations of IRT- and Non-IRT-based CAT with Personality Measures Monica M Rudick, Wern How Yam, and Leonard Simms, University at Buffalo, State University of New York

A variety of approaches have been implemented to create CAT personality assessments. Recent research has focused on IRT for CAT personality measures, although its use is both computationally complex and requires certain assumptions to be met that do not always hold for personality measures. As a result, non-IRT-based CAT approaches, such as the countdown method, have also successfully been applied to CAT versions of personality measures. In the countdown method, there is some debate regarding whether classification or full-scores-on-elevated-scales (FSES) methods are more preferable. In addition, it is unclear how order of item administration might impact item savings and the validity of scores. Both IRT and non-IRT based methods appear to yield numerous advantages for CAT assessments, most notably time and item savings, and ease of administration. However, these two methods have yet to be directly compared. The purpose of the present study was to compare non-IRT and IRT-based approaches utilizing real-data CAT simulations on a large diverse sample (N = 8,690) who completed the Schedule for Nonadaptive and Adaptive Personality (SNAP). The report focuses on the three longest SNAP Scales: Disinhibition (DIS), Negative Temperament (NT) and Positive Temperament (PT). Simulation analyses compared item savings, item and test information, test validity, and fidelity across the IRT- and non-IRT CAT methods. In addition, within the countdown method simulations, the simulations examined whether item presentation order impacted the results. Results will have implications for test developers wishing to apply CAT technology to personality measures. For further information: mmrudick@buffalo.edu

Adaptive Measurement of Cognitive Ability Based on a Person’s Zone of Nearest Development. Marina Chelyshkova and Victor Zvonnikov, State University of Management, Russia

At the present moment the majority schools and universities of Russia attach great importance to cognitive process in education. We think that in modern testing it is important not only to estimate the degree of knowledge that the person has but also to evaluate cognitive ability, which is more complex than knowledge and skills. The measurement of cognitive ability usually requires the special content of items, which cognitive learning theories provide. But there are other aspects of such measurement. They are connected with optimization of an item’s difficulty and require the application of adaptive testing. Weiss analyzed person characteristic curves and suggested some methods for adapting the test item’s difficulty to the individual. These ideas were combined with the concepts of Russian scientist L. S. Vigotsky who suggested the ratio of ability to knowing something (actual zone) and ability to develop of a person’s internal mental forces. His concept allows to connect the score of actual knowledge with the width of a person’ s zone of the nearest development. We suggested the method for evaluating this connection by using one-parameter and two-parameter models of IRT expressed it in the form of the system of inequalities which related the person parameter and item’s parameters. As applied to measurement of a cognitive ability we suggested to choose items that have difficulty appropriate to person’ s zone of the nearest development instead of traditional scoring approaches in adaptive testing. We developed the connection between the width of the nearest development zone and scores of test items in terms of the difficulty and slope of item characteristic curves. It has allowed us to evaluate a person’s cognitive ability and to predict his/her changes of achievement depending on the time factor and the steepness of his/her person characteristic curve. Thus, in such a way we can optimize the difficulty of test items in adaptive testing for measurement of cognitive ability. For further information: mchelyshkova@mail.ru

Implementing Figural Matrix Items In a Computerized Adaptive Testing System – Singapore’s Experience Poh Hua Tay and Raymond Fong,Ministry of Education,Singapore

Figural matrix items such as Raven’s Standard Progressive Matrices (SPM) are widely used for assessing general intelligence of pupils. Substantial manpower resources are incurred when administering tests on a large scale basis via paper-and-pencil (P&P). A computer-based test (CBT) would offer the advantages of logistical ease during the data collection stage, and administrative ease during the data entry stage; this is especially so for CAT, as it reduces administration time, as well. Unlike P&P and CBT, the most appropriate set of items in a CAT can be adaptively selected for each pupil based on his/her responses to previous items. This permits each pupil to be evaluated on a smaller subset of the total item pool, having better test experience as items are chosen based on his/her ability; and allows the test developer to control the error of measurement to a desired degree of precision. In this study, an item bank of 195 figural matrix items that are similar to SPM’s was created. The psychometric properties of these items were then established after trialing them on a sample of 6,821 Primary 2 pupils (equivalent to Grade 2 pupils who are about 8 years in age) of varying academic abilities from 20 coeducational schools in Singapore. IRT was used to calibrate all the figural matrix items. From this item bank, a P&P prototype, two CAT prototypes (one starts with an easy item, while the other starts with an average item), and a CBT prototype were generated and administered, via the FastTEST Pro v2.3 platform, to four groups of Primary 2 pupils in Singapore. These groups consisted of a total of 948 Primary 2 pupils of varying academic abilities and were selected from 12 coeducational schools. SPM was also administered to all of them via P&P. This project was designed to study the comparability of the abilities of pupils estimated from the differentpPrototypes (P&P, CATs, CBT) and SPM. For further information: tay_poh_hua@moe.gov.sg

Using Enhanced Effective Response Time to Detect the Extent and Track the Trend of Item Pre-Knowledge on a Large-Scale Computerized Adaptive Assessment Jie Li and Xiang Bo Wang, ACT, Inc.

In addition to being highly efficient and accurate in terms of scoring, diagnosis, and reporting, CAT is also known for its global ease and reach of test delivery (Wainer et al, 2000; Meijer & Nering, 1999; Parshall, Spray, Kalohn, & Davey, 2002). However, the latter advantage of CAT also introduces a tenacious problem of potentially exposing items to a high number of examinees due to its high frequency of test administration, which is likely to increase advance or pre-knowledge of items and to jeopardize score validity. Of great concern and interest to the entire educational testing industry is the possibility of validly detecting and tracking the extent that CAT items are exposed. The purpose of this research was (1) to establish population item response times for all items and associated trends for all items with a large-scale international CAT assessment and (2) to investigate the feasibility of applying “effective response time” (ERT; Meijer & Sotaridona, 2006) to detect the extent and track the trend of item pre-knowledge on suspected compromised items on this assessment. The study was based on both operational and simulated data of a large item pool of a large-scale international CAT assessment. This item pool was selected because (1) it had a substantial number of new items that were pretested in several years ago when little or no item pre-knowledge could be assumed and (2) these pretest items had a long history of operational use in subsequent years when item pre-knowledge could have been accumulated. ERT indices for both items and examinee, as described by Meijer & Sotaridona (2006), were computed against a large collection of new items at their pretest time after they passed stringent pretest item quality reviews. The ERT indices from this round were used as null hypothesis benchmarks since no serious item pre-knowledge could be assumed. In addition, simulations were conducted to project the values of these ERT indices, if examinees’ response times were reduced by one-half and one-fourth, respectively. Examinees ability estimates on the operational items of this item pool were used for ERT modeling. ERT indices were also computed when all the new items were first used operationally and the results were compared with their pretest counterparts. For further information: Jie.Li@Act.org

Computerized Adaptive Testing for the Singapore Employability Skills System (ESS) Patricia Rickard, CASAS, James B. Olsen, Alpine Testing Solutions, Debalina Ganguli, CASAS, and Richard Ackermann, Team Code, Inc.

This paper presents and demonstrates innovations in computerized adaptive testing of adult workplace literacy and numeracy skills developed by CASAS and customized for the Singapore Employability Skills System (ESS).  The Singapore Workforce Development Agency (WDA) plays a pivotal role in the implementation of the ESS “to enhance the employability and competitiveness of employees and job seekers, thereby building a workforce that meets the changing needs of Singapore’s economy.”  CASAS has designed and developed CATs for mathematics, reading, and listening, and computer-delivered tests for writing and speaking, suitable for adults. The CATs are administered in secure proctored locations using local area networks and an electronic access key (dongle). This paper presents an overview of the project, demonstrations of sample test items from the test battery, presentation of the test delivery and administration system, review of test score results and psychometric analyses, and plans for future enhancements and extensions.  The Singapore CATs use the following psychometric procedures: selection of initial item from a random proficiency value near the center of proficiency distribution of the selected item bank, Rasch model calibration and proficiency estimation, and a stopping rule based on a minimum standard error or administration of a specified maximum number of items. Results for the mathematics and reading CATs are presented showing scale score population distributions, stopping rule exit criteria, item exposure distributions, and ability estimate and standard error curves across the item administration sequence. The paper presents summary recommendations for enhancements and extensions with the CAT tests and additional CAT research and validity investigations. The CAT results are based on examinee samples of approximately 12,000 for the reading tests and 9,000 for the numeracy tests. For further information: rickard@casas.org

Computerized Adaptive Testing in Spain: Description, Item Parameter Updating, and Future Trends of eCAT Francisco J. Abad and David Aguado, Universidad Autónoma de Madrid, Juan Ramón Barrada, Universidad Autónoma de Barcelona, Julio Olea, Vicente Ponsoda, and Francisco J. Abad, Universidad Autónoma de Madrid

eCAT is a CAT developed and applied in Spain to assess English proficiency in Spanish speakers. The test was developed by psychometricians from the School of Psychology (Universidad Autónoma de Madrid) and the IIC (Engineering Institute of Knowledge). Psychometricians constructed the item bank and designed the adaptive algorithm. The IIC takes care of the marketing and control of the test delivery via the Internet. At this time, thousands of tests have been administered in the context of the personnel selection processes and for the assessment of undergraduate’s language competences in several Spanish universities. In this presentation we will summarize the work done for the design and updating of the system. We will address four different aspects of eCAT: (1) test construction, including item bank design and calibration, adaptive algorithm, psychometric properties of the θ scores (reliability and validity), computerized reports, and software for web-based application; (2) main results of the application (descriptive study of <θ scores, estimation errors, execution time and exposure rates); (3) analysis of parameter drift and its impact on the θ scores, assessed by means of a comparison between the estimates of parameters in the initial calibration sample and those obtained under eCAT ordinary operation; and (4) work in progress: item parameter updating, increasing the bank size using on-line calibration procedures, and calibrating a new bank of items to assess the level of English listening (eCAT-listening). For further information: fjose.abad@uam.es

Item Selection With Biased-Coin Up-and-Down Designs Yanyan Sheng, Southern Illinois University at Carbondale

A basic ingredient in computerized adaptive testing (CAT) is the item selection procedure that sequentially selects and administers items based on a person's responses to the previously administered items. For decades, maximum information (MI; Lord, 1977; Thissen & Mislevy, 2000) has been widely used as the conventional algorithm for item selection in CAT. However, this criterion based on Fisher’s information only targets the middle difficulty level where a person has about 0.5 probability of getting the items correctly, and hence is not applicable in situations where a different percentile is desired. In addition, MI heavily relies on an accurate estimation procedure that works well in all testing situations. Nonetheless, studies have shown that such a procedure is not readily available. The biased-coin up-and-down design (BCD; Durham & Flournoy, 1994) has been widely used in bioassay for sequential dosage level selection because it can target any arbitrary percentile in addition to being efficient (Bortet & Giovagnoli, 2005). As the problem in bioassay shares many similarities with CAT, it is reasonable to believe that the item selection algorithm based on the BCD, which does not rely on an accurate trait estimate in every step of CAT administrations, provides an efficient alternative to, while being more flexible than, the conventional method. The development of this selection algorithm is essential as schools, professional organizations, and private companies seek to make CAT flexible enough to be implemented in wider testing applications. The purpose of this study was to illustrate the use of the BCD in CAT and further evaluate its utility by comparing it with the conventional MI algorithm. For ease of comparisons, this study focused on the 1-parameter item response function. To investigate the utility of the BCD in CAT, two Monte Carlo simulation studies were conducted where either a fixed- or a random- stopping rule was employed. With fixed-stopping rule, the number of items administered was manipulated (k = 5, 10, 30, 100) and the item pool was fixed to have 100 different difficulty levels, whereas with random-stopping rule, the number of different difficulty levels in the item pool was manipulated (n = 10, 30, 50, 100). In either case, CAT responses were simulated for persons whose actual trait levels were 0 (average), 1 (1 standard deviation below the average), and -2 (2 standard deviations below the average), and the target difficulty level was at the 20th, 50th or 80th percentile. Each adaptive testing simulation began the trait estimation with an initial value of 0 and proceeded with the maximum likelihood method. The results suggested that item selection with the BCD is more flexible in targeting any arbitrary percentile of the difficulty levels. With respect to the accuracy of the trait estimation, MI performs slightly better with fixed-stopping rule, whereas the BCD is considerably better for tests with small number of different difficulty levels or persons whose trait levels are not at the extremes with random-stopping rule. For further information: ysheng@siu.edu

On-the-Fly Item Calibration in Low-Stake CAT Procedures Sharon Klinkenberg, Marthe Straatemeier, and Han van der Maas, University of Amsterdam

We present a new model for computerized adaptive progress-monitoring. This model is used in the Math Garden, a web-based monitoring system, which includes a challenging web environment for children to practice arithmetic skills. The Math Garden is a CAT web application, which tracks both accuracy and response time. Using a new model (Maris, in preperation) based on the Elo (1978) rating system and an explicit scoring rule, estimates of ability level and item difficulty are updated every trial. Items are sampled with a mean success probability of .75, making the tasks challenging yet not too difficult. By integrating the response time in the scoring rule, we try to compensate for the loss of information associated with the high success rates (van der Maas and Wagenmakers, 2005). In a period of eight months, our sample of 1,053 children completed over 850,000 arithmetic problems. The children completed about 25% of these problems outside their school hours. Results show good validity and reliability, high pupil satisfaction measured in playing frequency, and good diagnostic properties. The ability scores correlatde highly with the Dutch norm-referenced general math ability scale of the pupil monitoring systems of CITO. Also, test retest reliability analysis showed high correlations. In view of the satisfactory validity and reliability of the person ability estimators, our method opens the door to on-the-fly item calibration in low-stakes testing. For further information:S.Klinkenberg@uva.nl

Investigating Cheating Effects on the Conditional Sympson and Hetter Online Procedure with Freeze Control for Testlet-based Items Ya-Hui Su, University of California, Berkeley

In CAT, if a group of examinees purposefully memorize items and distribute them to other prospective examinees, it certainly ruins the equality and accuracy of CAT. Steffen and Mills (1999) investigated this effect and found that the more the compromised and the more effective the cheating, the more severe the overestimation for the recipients, especially for those with low ability levels. Su, Chen, and Wang (2004), pointed out that overestimation for the recipients was more severe when the sources had diverse ability levels, because more items were compromised. Su and Wang (2007) propose an item exposure control procedure, called the conditional Sympson and Hetter (Sympson & Hetter, 1985) online procedure with freeze control (denoted as SHCOF) procedure. Results showed it superior to many other conventional procedures in terms of measurement and operational efficiency. To assess the cheating effect, Su and Wang (2008) used the SHCOF in a CAT, and found it could obtain precise estimation for persons in real time without requiring simulations to generate item exposure under a unidimensional context. In the past, little research has been done to investigate cheating effects within testlet context. Hence, it is of great value to ascertain whether the SHCOF affected by the cheating between examinees under testlet context, when compared to a popular procedure such as the conditional multinomial method (SLC; Stocking & Lewis, 1998). The goal of this study was to use simulations to investigate how these two item exposure control procedures would perform under various cheating conditions. It was hypothesized that SHCOF would be less by cheating than SLC. Four independent variables were manipulated: (1) ability level of sources, (2) ability distribution of recipients, (3) cheating conditions (no cheating, inefficient cheating, efficient cheating, and perfect cheating), and (4) item exposure control procedure (SHCOF and SLC). The root mean squared error (RMSE) was computed to describe the cheating effects; the more serious the cheating effect, the larger the RMSE. Under the no-cheating condition, there is no significant difference in RMSE between SHCOF and SLC. It was also found that SLC had more serious inflation on RMSE than SHCOF under the perfect cheating condition. As the cheating condition got more severe, the overestimation for the recipients got more severe when the SLC was used. In addition, the more diverse the ability of the sources, the larger the RMSE and the mean positive bias would be. More importantly, SHCOF had smaller RMSE than SLC. This was because only SHCOF could simultaneously monitor item exposure and test overlap rates online. SHCOF could obtain precise estimation for persons without requiring simulations to generate item exposure before using in an operational CAT. If test items are memorized by sources and shared to recipients, CAT becomes unfair because the ability levels of the recipients will be overestimated. In this study, it was found that SHCOF was less affected by cheating than SLC. Hence, the SHCOF procedure can be safely implemented in operational CAT. For further information: yahuisu@berkeley.edu

The CAT-DI Project: Development of a Comprehensive CAT-Based Instrument for Measuring Depression Robert D. Gibbons, University of Illinois at Chicago

The combination of IRT and CAT has proven invaluable in educational measurement. More recently, enormous reduction in patient and physician burden have been demonstrated using IRT based CAT in the area of mental health measurement problems (Gibbons et.al., 2008). CAT administration of a 626-item mood and anxiety spectrum disorder inventory revealed that an average of 24 items per examinee were required to provide impairment estimates with a correlation of 0.93 with the original complete scale. Furthermore, the CAT-based scores revealed twice the effect size than the total scale score in terms of differentiating patients with bipolar disorder based on the mood disorder subscale, despite an 83% reduction in the average number of items administered. These preliminary findings led to further interest and funding by the National Institute of Mental Health to develop a CAT-based instrument for the screening of major depressive disorder (CAT Depression Inventory—CAT-DI) that can be used for routine screening of depression in general medical practice settings as well as specialty mental health clinics. A recent supplement to the parent CAT-DI grant, extends our work on CAT for mental health measurement to CAT for diagnostic assessment of depression and other psychiatric disorders. The CAT Major Depressive Disorder (CAT-MDD) project will explore four different statistical/psychometric models for estimating the probability of an underlying discrete major depressive disorder based on self-administered symptom ratings that are adaptively administered. The ultimate objective of this program of research is to reduce patient and physician burden in terms of screening and diagnosing depression in general practice settings. Potential benefits include reduction in health care costs produced by high rates of service utilization among patients with an undiagnosed depressive illness, increased detection of depressive disorders, and increased access to quality mental health care for patients in need of such services. For further information: rdgib@uic.edu

Development of a CAT to Measure Dimensions of Personality Disorder: The CAT-PD Project Leonard J. Simms, University of Buffalo

In this presentation, describes the CAT-PD project, a funded, multi-year study designed to develop an integrative and comprehensive model and measure of personality disorder trait dimensions. Our general study aims are to (1) identify a comprehensive and integrative set of dimensions relevant to personality pathology, and (2) develop an efficient CAT method—the CAT-PD—to measure these dimensions. To accomplish our general goals, we plan a five-phase project to develop and validate the model and measure. The presentation describes the project generally, the results of Phase I (which is focused on content domains and initial item bank development), and our plans for IRT/CAT with these item banks. In particular, I will focus on how the item banks will be used, the possible IRT models we are considering for item bank calibration, the CAT algorithms we are planning to test, and our methods for deciding on a final set of procedures for the completed CAT-PD measure. Finally, I will discuss the CAT and IRT challenges that we anticipate facing in the future. For further information: ljsimms@buffalo.edu

Multidimensional Adaptive Personality Assessment: A Real-Data Confirmation Alan D. Mead, Avi Fleischer, and Jessica D. Sergent, Illinois Institute of Technology

Although CAT was developed in the context of ability tests (Weiss, 1982), studies have since demonstrated the effectiveness of CAT for measuring attitudes and personality. For example, Koch, Dodd, and Fitzpatrick (1990) applied the rating scale model to a Likert-scale attitudinal questionnaire. The rating scale model (an extension of the one-parameter logistic model for polytomous data) was found to fit the data very well and, although they noted item pool issues, succeeded in measuring effectively. Other studies have found similar results for personality assessments, suggesting that perhaps half the items of an assessment are needed to achieve comparable reliabilities (Waller & Reise, 1989; Reise & Henson, 2000). However, one issue that has not been extensively treated in prior literature is the multidimensional nature of most personality assessments. Prior research has generally applied unidimensional CAT to individual scales. Segall (1996) presented a multidimensional CAT (MCAT) methodology where correlations between the factors could be leveraged to administer and score items even more efficiently. Mead, Segall, Williams and Levine (1997) described a Monte Carlo simulation of the adaptive administration of the 16PF Questionnaire (Cattell, Cattell, & Cattell, 1993; Conn & Rieke, 1994) using Segall’s MCAT method. As in Segall’s simulation, the MCAT method was effective in allowing additional reductions in assessment length, beyond those typically encountered with unidimensional CAT. For example, overall assessment length could easily be cut in half with small decrements in scale reliabilities. The purpose of the current study was to extend the results of the Monte Carlo simulation (Mead, et al, 1997) to real data. This study is important for two reasons. First, it is always important to show that simulated results generalize to actual use. Even more importantly, recent research on personality (research that specifically included the 16PF; Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001) has suggested that traditional IRT models do not fit personality data well and might not be the most appropriate models (Stark, Chernyshenko, Drasgow, & Williams, 2006). If the IRT model is a poor fit to 16PF data, the Monte Carlo results will not hold for real data. On the other hand, if the real-data results replicate the simulation results, then we might assume that traditional IRT models fit 16PF data sufficiently well. We obtained archival data from the administration of the 16PF Questionnaire to approximately 5,000 individuals and the two-parameter logistic model was fit to the items using BILOG-MG 3.0. Segall’s (1996) software was adapted to read the actual responses of the individuals for a real-data simulation. Results generally supported the use of MCAT with 16PF items. Correlations between actual 16PF scores and MCAT trait estimates were high (averaging .91 to .82) for MCAT tests shortened by up to 40–50% while shorter MCAT tests had moderate correlations (averaging .72 to .58). The presentation will also discuss results for the pool usage (about a third of the pool had exposures greater than 90%), efficiency for individuals with extreme scores, and practical considerations for adaptive personality assessment. For further information: jsergent@iit.edu