There are a number of decisions to be made in implementing an adaptive test. The following is a brief discussion of several of these issues. These only represent a few at this point in time -- others will likely have a different viewpoint on these issues. If you would like to express an opinion on these issues, please start a discussion in our forum and we can present a productive dialogue that will be useful to others.
Much of the research and development in CAT has been done in the context of achievement testing. Although some achievement domains are both unidimensional and relatively homogeneous – that is, they measure a single variable without substantial variation in content – some are relatively unidimensional but include two or more content domains. An example is arithmetic achievement at the elementary school level. The basic arithmetic operations (addition, subtraction, multiplication, and division) can be scaled on a single difficulty continuum, but they represent distinct operations for assessment purposes.
Because these operations can be scaled on a single difficulty scale, IRT procedures could be used to create an item bank for arithmetic achievement for use in a CAT. However, the difficulty differences among these operations would result in CATs that had different weightings of these operations across different examinees – high ability students would tend to get mostly division items and low ability students would receive mostly addition items. Thus, although all students would be measured on the same achievement scale, the content of their tests would differ across the four operations.
Several procedures have been proposed to achieve “content balance” among examinees in domains of this type (e.g., Kingsbury & Zara, 1991; see bibliography for additional references). These procedures modify the maximum information item selection procedure by also considering the content category of the items in the item selection process. Once an item is selected by maximum information at the examinee’s current , its content classification is examined relative to target values set up in advance for each examinee. If the selected item represents a content area that is underrepresented at that stage in the examinee’s test, the item is administered. If not, the item that provides the next highest information is examined relative to the content targets and the process is repeated until an item from the appropriate content target is identified.
Clearly, by modifying the maximum information item selection procedure, content balancing reduces the efficiency of CATs. The result will be tests that are longer than they would otherwise need to be to achieve the same measurement objectives. In the context of assessment for counseling purposes, content balancing is likely to be an issue only for CAT-based achievement tests. Measures of ability, personality, and preferences usually are both unidimensional and homogeneous in content and, therefore, will not require content balancing. See multiple scales for an alternative approach to CAT when there are multiple content areas in a test.
Kingsbury, G. G., & Zara, A. R. (1991). A comparison of procedures for content-sensitive item selection in computerized adaptive tests. Applied Measurement in Education, 4, 241-261.
As CATs are administered to groups of examinees, different items are taken by different individuals. Depending on the relationship between the trait distribution of the examinees and the information structure of the item bank, different items will be used (or “exposed”) at differing rates. A number of procedures have been proposed to control for item exposure by not administering a selected item to an examinee if the probability of over-exposure is high (e.g., Hetter & Sympson, 1997) or by modifying maximum information item selection to allow selection of items that are otherwise unlikely to be administered (e.g., Revuelta & Ponsada, 1998). These procedures function similarly to content balancing procedures. That is, they modify the maximum information item selection procedure by constraining item selection to select items so as to control their probable exposure rate across a group of examinees.
Item exposure can be problematic in large testing programs or in some school settings when test scores are used to make decisions or judgments about examinees. In these cases when there is an incentive for examinees to have access to items in the CAT item bank, by reducing the frequency of exposure of an item the likelihood of an examinee having prior access to the item (and the correct answer) is reduced.
Similar to content balancing, item exposure controls impose constraints on maximum information item selection and reduce the efficiency of CATs. Consequently, their use will result in longer tests than otherwise would be required. When tests are used for clinical, counseling, or research purposes, however, there is rarely incentive for examinees to improve their scores from prior knowledge of the correct or keyed answers to items that are likely to appear in a CAT. Hence, item exposure controls are unlikely to be necessary in CATs used for these purposes and unconstrained CATs can be used for maximally efficient and effective measurement.
Hetter, R. D., & Sympson, J. B. (1997). Item exposure control in CAT-ASVAB. In W. A. Sands, B. K. Waters, & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 141-144). Washington DC: American Psychological Association.
Revuelta, J., & Ponsada, V. (1998). A comparison of item exposure control methods in computerized adaptive testing. Journal of Educational Measurement, 35, 311-327.
See bibliography for a rapidly growing body of literature on item exposure.
Closely related to the issue of content balancing is the application of CAT to measuring instruments that measure an examinee on multiple scales. Such instruments include ability test batteries (such as the Armed Services Vocational Aptitude Battery or the Differential Aptitude Tests), multiple-scale personality inventories that are comprised of unidimensional scales, and attitude and preference scales that measure multiple homogenous variables. For these types of instruments, the issue of “content balance” is achieved by treating each scale as a separate unidimensional variable and obtaining IRT parameters for CAT separately for each scale. Then, CAT can proceed separately for each scale to measure each examinee as well as possible on each scale. The result is, typically, a profile of scores for each examinee that can be used for applied purposes.
When used for this type of measurement objective, CAT will provide highly precise and efficient measurements separately for each scale. However, the process of measuring an individual on multiple scales can be made even more efficient by an extension of the CAT procedure to the multiple-scale measurement problem.
Most test batteries or instruments with multiple scales result in scores that are intercorrelated to some degree. Ability test scores tend to correlate in the .30 to .50 range, and scales on personality and preference scales can have higher or lower intercorrelations depending on the nature of the variables being measured by the scales. Because CAT can use differential starting values for beginning a test, the scale intercorrelations can provide information that can be used as starting values for tests after the first in a multiple-test application. Brown and Weiss (1977) proposed that scale intercorrelations among a set of tests be computed using theta estimates from a test development group. The pair of scales with the highest correlation is chosen. Then, the multiple regressions of each scale as predicted from those scales are computed and a new scale is added to the first two as the scale that can best be predicted from them. This process is then repeated with three scales. Based on these multiple correlations as each new scale is added , subtests can be ordered by how well they can be predicted from the other subtests. Finally, the multiple regression equations can be used to predict an examinee’s initial theta on each new test in the battery as a starting value for that test.
This “inter-subtest” branching further enhances the efficiency of the separate CATs for each subtest by providing increasingly accurate starting values for each subsequent test in the battery. The result is further – and sometimes dramatic – reductions in the numbers of items to measure an examinee on multiple correlated traits. In the context of an achievement test battery, Brown and Weiss (1977; Gialluca & Weiss, 979; Maurelli & Weiss, 1981) demonstrated that test lengths for later tests in a battery could be reduced by 80% or more of their full test length with no reduction in measurement precision.
Brown, J. M., & Weiss, D. J. (1977). An adaptive testing strategy for achievement test batteries (Research Rep. No. 77-6). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory.
Gialluca, K. A., & Weiss, D. J. (1979). Efficiency of an adaptive inter-subtest branching strategy in the measurement of classroom achievement (Research Report 79-6). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory.
Maurelli, V. A., & Weiss, D. J. (1981). Factors influencing the psychometric characteristics of an adaptive testing strategy for test batteries (Research Rep. No. 81-4). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory.
Some operational CAT programs are terminated by administering a fixed number of items or by imposing a time limit. Both of these termination procedures are used for the convenience of the test administrator, and neither is grounded in good CAT practice. A test that is terminated for either of these reasons will not allow the CAT to continue until a CAT-based termination criterion can be implemented. If the CAT termination criterion is a specified maximum SEM, a prematurely terminated CAT will not result in equiprecise measurement, since the SEM does not decrease for all examinees at the same rate. Similarly, a CAT designed for equally confident classifications will, if terminated early, result in classifications of lower quality for some examinees. To obtain the maximum benefits of CAT, neither time limits nor a fixed test length should be imposed.
With the emergence of the Worldwide Web in the last decade, many tests of ability, personality, and preferences have been modified for delivery on the Web. Typically, a number of items are downloaded as a scrollable “page,” the examinee answers the questions, then returns the completed page through the Web. A long test might deliver several such pages.
Web delivery of tests appears to be a logical next step from the earlier conversion of tests from paper-and-pencil to delivery by personal computer (PCs) that occurred beginning in the 1980s. The PC testing movement, however, was supported by a large body of research that supported that conversion by demonstrating that for the most part there were no effects on test standardization due to PC administration (e.g., Mead & Drasgow, 1993), with the exception of tests that were primarily speed tests.
There has been almost no research to support the conversion of most tests to Web-based delivery, which can be quite different from PC-based delivery. In PC-based test delivery, the administration process is carefully standardized by software that will deliver a test in exactly the same way to each examinee. When tests are delivered by Web browsers, the variety of browsers and browser settings can potentially wreak havoc with test standardization, thus potentially invalidating test results. Without research to demonstrate the equivalence of Web-delivered tests, there is potential great risk that the mode of administration might adversely affect the standardization of the instrument and impact the accuracy, validity, and utility of test scores.
The potential lack of standardization is even more likely to occur if a CAT is delivered over the Web. Because each item in an IRT-based CAT is selected based on the examinee’s scored answers to all previous items, computations must be implemented after each item response is received to select the next item, and the item bank must be available to deliver that item. It might, therefore, be tempting to deliver an item over the Web, send the answer back to the server for scoring, maximum likelihood estimation, and selection of the next item based on item information, then transmit the selected item to the examinee through the Web. This process would then need to be repeated for every item.
In addition to the Web delivery problems of conventional tests, this process would introduce an additional source of potentially negative influence on test scores – Web response time. Although sometimes the Web responds quite quickly, there are other times when waits of several seconds or more are evident. Such response times were typical in the 1960s and 1970s when electronic test delivery was attempted on time-shared computers. In most cases the between-item delays were unacceptable and interfered with the standardization of the test-taking process, and time-shared delivery of standardized tests was generally abandoned until PCs eliminated the time-shared delays. Item-by-item delivery of CATs through the Web would likely be a return to this approach of extremely unstandardized test delivery, thereby further compromising the utility and validity of test scores.
Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114, 449-458.
Non-Cognitive Assessment and the Worldwide Web
The following excerpt is from
Butcher, J. N., Perry, J,. & Hahn, J. (2004). Computers in Clinical Assessment: Historical Developments, Present Status, and Future Challenges. Journal of Clinical Psychology, 60, 331–345.
Internet-Based Test Applications
The growth of the Internet and its broadening commercial applications have increased the potential to administer, score, and interpret psychological tests online. At present, there is a great deal of interest in expanding the availability of psychological assessment services through the Internet. A casual surfing of the Web finds a wide variety of test applications including tests for entertainment purposes, proposed self-help, marketing of commercial tests, and research (Buchanan & Smith, 1999b). Many of the available applications are not “professional” in scope and do not adhere to the American Psychiatric Association’s psychological testing guidelines.
There are a number of problems associated with the application of psychological tests on the Internet that need to be addressed by psychologists before the Internet can become a major medium for psychological service delivery (see discussions by Buchanan, 2002; McKenna & Bargh, 2000). Issues that need to be addressed before wider clinical use of psychological tests on the Internet are discussed next.
Assurances of Equivalent Test-Taking Attitudes
There is some indication that data collected from personality inventory research through the Internet may differ substantially from paper-and-pencil administration. Pasveer and Ellard (1998) found that administering tests online uncovered problems that need to be addressed (e.g., multiple submissions of the record; one participant in their study submitted their test forms ten times). There needs to be assurance that the individual taking the test has approached the task with the cooperative response attitudes present in the normative sample. Some studies (Davis, 1999) have reported comparable findings; however, response sets for Internet-administered tests versus those of standard administration have not been sufficiently studied. Some research has suggested that different response environments might produce different test results (Buchanan & Smith, 1999a). Thus, it would be important to assure that such test-administration strategies would not produce results different from standard administration. Although, as noted earlier, Internet administered versus booklet-administered tests have not been widely studied, Buchanan and Smith (1999b) noted that equivalence between the two administration formats cannot be assured. If Internet administration involves procedures that are different from typical computer administration, then these conditions also should be evaluated.
Assurances That Test Norms Are Appropriate for an Internet Application
Many psychological tests have standard norms with which to compare clients’ performance as key to the interpretation process. For example, many tests have norms that are based upon a sample of individuals taking the test under carefully controlled and monitored conditions. The application of test norms usually requires that the individuals being compared were tested under comparable conditions. Relatively few traditional psychological tests have been developed through the Internet. One exception to this was the Dutch-language version of the MMPI-2, which was standardized through an Internet data-collection program (Sloore, Derksen, de Mey, & Hellenbosch (1996). However, the most widely used assessment instruments have not been normed online. Internet administration of tests and typically standardized instruments may not be comparable. Changing the administration procedures can make the standard test norms inappropriate. There are several studies suggesting that tests administered on the Internet produce different results than those administered under standard conditions (Buchanan & Smith, 1999b). Consequently, making traditional tests available to clients through the Internet would represent a very different test-administration environment than that for the original test.
Assurances of Test Validity
As with any psychological test, the Internet version of the test needs to have demonstrated reliability and validity. One cannot be assured that a particular test administered on the Internet produces the same results (i.e., measures the same constructs) as it does under paper-and-pencil conditions. It is important to assure that the test correlates are comparable to those on which the test was originally developed. Some evidence has suggested that Internet-administered tests show construct validity that is comparable to traditional administration (Buchanan & Smith, 1999b).
Test Security. The question of test security has several facets. Most psychologists are aware that test items are considered “protected” items and should not be made public to prevent the test from being compromised. In fact, psychologists are ethically obliged to assure that the test items are secure and not made available to the public (American Psychiatric Association, 2002). Making test items available to the general public would undermine the value of the test for making important decisions. The security of information placed on the Internet is questionable. There have been numerous situations in which “hackers” have gotten into the highly secured files of banks, the State Department, and so forth. It is important for test security before items are made available through the Internet.
Some psychological tests are considered to require higher levels of expertise and training to interpret and are not made available to psychologists without clear qualifications. Many psychological tests, particularly those involved in clinical assessment, are tests that require careful evaluation of user qualifications. Wide availability of tests on the Internet would likely result in nonqualified test users gaining access to the test.
An additional consideration involves test ownership. Most psychological tests are copyrighted and cannot be copied without permission. Making test items available through the Internet increases the likelihood that copyright infringement will occur.
Psychological test distributors need to develop procedures to assure that the problems noted here do not occur. There are, of course, ways of controlling access to test materials in a manner similar to the way they are controlled in normal clinical practice. For example, even an Internet administration of tests could be secured; that is, be made available only to practitioners who would administer them in controlled office settings. The item responses then could be sent to the test-scoring/interpreting service through the Internet for processing. The results of the testing could be returned to the practitioner electronically in a coded manner that would not be accessible to nonauthorized persons. It is possible that tests, though processed through the Internet, could still be administered and controlled through individual clinicians. It also is possible that the problems described here could be resolved by limiting access to the test in much the same way that credit card numbers are currently protected.
The growth of the Internet and broadening commercial uses have increased the potential to administer, score, and interpret psychological tests online. Commercial test publishers have been receiving a great deal of pressure from test users to make more test-based services available on the Internet. The ethics of psychological test usage and standards of care as well as basic psychological test research have not kept up with the growth spurt of the Internet itself. Consequently, there are many unanswered questions as we move into the 21st century with the almost limitless potential of test applications facing the field.
American Psychiatric Association. (1987). Diagnostic and statistical manual of mental disorders (3rd ed., Rev.). Washington, DC: Author.
Buchanan, T. (2002). Online assessment: Desirable or dangerous? Professional Psychology: Research and Practice, 33, 148–154.
Buchanan, T., & Smith, J.L. (1999a). Research on the Internet: Validation of a World Wide Web mediated personality scale. Behavior Research Methods, Instruments, & Computers, 31, 565–571.
Buchanan, T., & Smith, J.L. (1999b). Using the Internet for psychological research: Personality testing on the World Wide Web. British Journal of Psychology, 90, 125–144.
Davis, R.N. (1999). Web-based administration of a personality questionnaire: Comparison with traditional methods. Behavior Research Methods, Instruments, & Computers, 31, 572–577.
McKenna, K.Y.A., & Bargh, J.A. (2000). Plan 9 from cyberspace: The implications of the Internet for personality and social psychology. Personality and Social Psychology Review, 4, 57–75.
Pasveer, K.A., & Ellard, J.H. (1998). The making of a personality inventory: Help from the WWW. Behavior Research Methods, Instruments, & Computers, 30, 309–313.
Sloore, H., Derksen, J., deMey, H., & Hellenbosch, G. (1996). Adaptions in Europe. In J.N. Butcher (Ed.), International adaptations of the MMPI-2: Research and clinical applications (pp. 329– 460). Minneapolis: University of Minnesota Press.