Lertap 5 documents series.
Some CTT & IRT Comments
Larry R Nelson
Last updated: 25 September 2017
(Click here to branch to www.lertap.com)

This page has been published on the Lertap 5 website. Please follow the hyperlink provided above if you wish to visit the Lertap site. The references cited in the comments below are to be found on the Lertap website; click here to bring up the References page.

Lertap 5 is a classical item and test analysis package. It incorporates some aspects of generalisability theory, has considerable support for users of IRT, Item Response Theory (see this link), but its true pedigree is CTT, classical test theory.

CTT has been with us for many, many decades. As Hambleton and Jones (1993) stated: The major advantages of CTT are its relatively weak theoretical assumptions, which make CTT easy to apply in many testing situations. In writing about CTT and the true-score theory at its heart, Hattie et al. (1999) wrote: "This measurement model has been the workhorse of educational and psychological measurement for much of this century. The model is simple, elegant, and surprisingly durable".

But CTT is not without critics. For quite some time, a primary criticism related to what was thought to be the instability of the item and person statistics produced by CTT. For years it was believed that the item statistics derived in CTT, such as item difficulty and discrimination, were dependent on the sample of respondents selected to answer the items. Give the same items to a different sample, and the difficulty and discrimination indices of CTT might vary substantially, or so it was thought. Similarly, in CTT the scores earned by test takers depend on the items they’ve been asked to answer; give them another set of items, and their test scores might not even be close to the same, or, at least, so it was widely thought.

These matters of perceived CTT instability are often cited as a principal reason behind the emergence of IRT, Item Response Theory. It might now be said that IRT has become a virtual industry unto itself; witness the number of books and software systems now available for IRT (there are numerous IRT texts available from Amazon.com, to name just one source).

It is presently common to see IRT referred to as the “modern” method of item analysis, with the obvious implication being that CTT is not modern.

But reputation should not be confounded by age. Not modern does not mean not useful. Not modern does not mean no longer appropriate. Indeed, not modern does not necessarily mean not best.

Before pointing out who uses CTT, and a system such as Lertap 5, let’s look at some relevant research that has compared the results obtained from IRT analyses with those resulting from CTT. It turns out that the methods converge much, much more than one would expect, given their radically different models.

First, look at Lawson (1991). He compared CTT with IRT’s Rasch model, looking at three different tests in three different samples. The title of his article gives away his findings: One parameter latent trait measurements: Do the results justify the efforts?

Lawson found “….remarkable similarities between the results obtained through classical measurement methods and those results obtained through one-parameter latent trait methods. Both procedures yield almost identical information regarding both person abilities and item difficulties….”. Nelson(2008) published a paper which also discusses the Rasch one-parameter IRT model, and some of the associated potentially misleading claims.

Then look at Fan (1998). He also compared the two methods, but used larger samples, and compared CTT with all three IRT models, not just the single-parameter Rasch model.

Fan wrote that “The findings indicate that the person and item statistics derived from the two measurement frameworks are quite comparable. The degree of invariance of item statistics across samples, usually considered as the theoretical superiority (of) IRT models, also appeared to be similar for the two measurement frameworks.”

Researchers in Sweden, working on national achievement tests, have undertaken work similar to Fan, comparing CTT and IRT within very large samples. Stage (1998) has summarised much of the results of the Swedish research, which has been extensive. She found, as did Lawson and Fan, that the conclusions from their studies “…. was that the results were very similar in spite of the differences between the theoretical frameworks”.

In a later report, Stage (2003) writing with regard to the development of the Swedish Scholastic Aptitude Test, SweSAT, stated "In the studies reported in this paper, the CTT indices were not only comparable to the IRT parameters, they were generally more invariant between different samples of test takers. One possible explanation of these results is that the IRT model did not fit the test data. But even if the results are due to poor model fit, the only reasonable conclusion is that for SweSAT data, CTT seems to work better than IRT" (p.25).

Comparing CTT and IRT in the development of an educational achievement test in Spain, Gil Escudero et al (1999) reported that, despite the theoretical “superiority” of IRT, “…ambas teorías no se diferencian a la hora de asignar puntaciones de rendimiento a los alumnos” (the two theories cannot be distinguished when it comes to assigning achievement scores to students).

MacDonald and Paunonen (2002) used Monte Carlo methods to continue the comparison of IRT and CTT properties. They reported findings suggesting “…IRT- and CTT-based item difficulty and person ability estimates were highly comparable, invariant, and accurate in the test conditions simulated. However, whereas item discrimination estimates based on IRT were accurate across most of the experimental conditions, CTT-based item discrimination estimates proved accurate under some conditions only”. A study of this article may suggest to some, in turn, that different results may have obtained had the authors included different indices of CTT item discrimination. An uncorrected point-biserial coefficient was used in the study; results would almost certainly have varied had a corrected biserial coefficient been employed.

In 2003, a doctoral student at the University of Alberta, Teresa Dawber, began to investigate another aspect of the relationship between CTT and IRT. She looked into the accuracy of two formulas from Lord (1968) which provide a means of deriving estimates of IRT parameters given common CTT item statistics. Dawber has found that, under certain conditions, the IRT parameter estimates resulting from the application of the Lord formulas may be sufficiently precise. For further reference, please see Experimental Features in Lertap 5.

As one looks at studies of CTT v. IRT, a common citation is to words given by Robert Thorndike as he addressed an Australian Council for Educational Research conference on IRT methods (words later included in Thorndike (1982)):

For the large bulk of testing, both with locally developed and with standardized tests, I doubt that there will be a great deal of change. The items that we will select for a test will not be much different from those we would have selected with earlier procedures, and the resulting tests will continue to have much the same properties.

In a review of the 4^th edition of Educational Measurement, a psychometric classic, Wainer (2006), referring Gulliksen’s CTT-based text, wrote:

I judge that at least of 80% of all psychometric demands at the Educational Testing Service could well be handled with the material in Gulliksen’s (1950/1987) classic text.

I have provided these citations in an effort to reassure Lertap 5 users who might think that they may be missing out on something if they don’t use IRT methods, or that in applying Lertap they’re using a system which is outdated. There’s clearly been a great IRT groundswell, but, at the end of the day, CTT has no need to dip its head. It’s not “modern” in that it is not a recently-developed tool, but it’s certainly still used by “modern” people, and so it should be. Given the findings from the various studies cited above, one might even be tempted to say that CTT has had some of its luster restored.

Talking about the end of the day, or, if you will, the bottom line: IRT has little immediate relevance to the everyday needs of classroom teachers and action researchers. The proper use of IRT methods involves steps having to do with model validation, and item calibration; substantial samples of test takers are usually required by the IRT calibration process. Classroom teachers concerned with using tests as part of their effort to assess student achievement, and everyday action researchers deploying straightforward affective instruments, have objectives which CTT remains eminently well suited to.

Readers will want to consult the references for more on the practical use and development of items and tests, and on the interpretation of relevant statistics. Lertap’s development has been influenced by a number of people, foremost among them Ken Hopkins. For good discussions on the use of tests, and on the interpretation of test scores and item statistics, see Hopkins (1998). Another outstanding text in this area is that by Linn and Gronlund (1995). The practical interpretation of Lertap 5’s many reports and tables is featured in Chapters 7 and 8 of the Lertap 5 manual (Nelson, 2000). The convergence of IRT and CTT, as mentioned here, is nicely addressed in the on-line research reports archive maintained by Umeå University’s Department of Educational Research (see the references for URLs to Stage (1998, 2003); when you get to this site, read at least three of the reports, EM No 33, No 34, and EM No 42; of these, if you have time to read only one, go for No 42).

References

The references cited in the comments above are to be found on the Lertap website. If you’re connected to the Internet, click here to bring up the References page.