Home
Quicklinks
Introduction
Documentation
Software
Classes
Reviews
History
References
Contact us

Out of cite

Allen, M.J. & Yen, W.M. (1979). Introduction to measurement theory. Monterey, California: Brooks/Cole.

Angoff, W.H. (1993). Perspectives on differential item functioning methodology. In P.W. Holland and H. Wainer (Eds.), Differential Item Functioning (pp. 3-23). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Berk, R.A. (1980). A consumer's guide to criterion-referenced test reliability. Journal of Educational Measurement, 17, 323-350.

Berk, R.A. (1984). Selecting the index of reliability. In R.A. Berk (Ed.) A guide to criterion-referenced test construction. Baltimore, Maryland: The Johns Hopkins Press.

Berk, R.A. (2000). Ask Mister Assessment Person. In Teachers: Supply and demand in an age of rising standards. Amherst, MA: National Evaluation Systems, Inc. (Note: in May, 2013, the papers in this series could be found as individual PDF files by searching the internet.)

Brennan, R.L. (1972). A generalized upper-lower discrimination index. Educational and Psychological Measurement, 32, 289-303.

Brennan, R.L. (1984). Estimating the dependability of the scores. In R.A. Berk (Ed.) A guide to criterion-referenced test construction. Baltimore, Maryland: The Johns Hopkins Press.

Brennan, R.L. & Kane, M.T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14, 277-289.

Brown, M.B. (1977). Algorithm AS 116: the tetrachoric correlation and its standard error. Applied Statistics, 26, 343-351.

Camilli, G. & Shepard, L.A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Publications.

Carr, N.T. (2011). Designing and analyzing language tests. Oxford: Oxford University Press.

Case, S.M. & Swanson, D.B. (1998). Constructing written test questions for the basic and clinical sciences. Philadelphia: National Board of Medical Examiners. Refer to www.nbme.org/about/itemwriting.asp.

Cattell, R.B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276.

Cizek, G.J. (2001). An overview of issues concerning cheating on large-scale tests. Paper presented at the annual meeting of NCME, the National Council on Measurement and Evaluation, April 2001, Seattle, Washington. PDF copy possibly available via: http://www.natd.org/Cizek%20Symposium%20Paper.PDF

Clauser, B.E., & Mazor, K.M. (1998). An NCME instructional module on using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.

Crocker, L.M. & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart, and Winston.

Dawber, T. (2004). Robustness of Lord's formulas for item difficulty and discrimination conversions between classical and item response theory models. Edmonton, Alberta: unpublished doctoral dissertation, University of Alberta (also see following reference).

Dawber, T., Rogers, W.T., & Carbonaro, M. (2004). Robustness of Lord's formulas for item difficulty and discrimination conversions between classical and item response theory models. Paper presented at the annual meeting of AERA, the American Educational Research Association, April 12, 2004, San Diego, California. PDF copy possibly available via: www.education.ualberta.ca/educ/psych/crame/research.htm.

de la Harpe, B.I. (1998). Design, implementation, and evaluation of an in-context learning support program for first year education students and its impact on educational outcomes. Perth, Western Australia: unpublished doctoral dissertation, Curtin University of Technology.

Dimitrov, D.M. (2003). Reliability and true-score measures of binary items as a function of their Rasch difficulty parameter. Journal of Applied Measurement, 4(3), 222-233.

Dorans, N.J. & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland and H. Wainer (Eds.), Differential Item Functioning (pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Dorans, N.J. & Kulick, E. (2006). Differential item functioning on the Mini-Mental State Examination: an application of the Mantel-Haenszel and standardization procedures. Medical Care, 44(11), S107-S114.

Du Toit, M. (Ed.) (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Lincolnwood (IL): Scientific Software International.

Eason, S. (1991). Why generalizability theory yields better results than classical test theory: a primer with concrete examples. In B. Thompson (Ed.), Advances in Educational Research: Substantive findings, methodological developments (Vol. 1, pp. 83-98). Greenwich, CT: JAI.

Ebel, R.L. & Frisbie, D.A. (1986). Essentials of Educational Measurement (4th ed.). Sydney: Prentice-Hall of Australia.

Fan, X. (1998). Item response theory and classical test theory: an empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58 (3), 357-381.

Feldt, L.S. (1984). Some relationships between the binomial error model and classical test theory. Educational and Psychological Measurement, 44, 883-891.

Frederiksen, N., Mislevy, R.J., & Bejar, I.I. (Eds.) (1993). Test theory for a new generation of tests. Hillsdale, NJ: Lawrence Erlbaum Associates.

Garrett, H.E. (1952). Testing for teachers. New York: American Book Company.

Gil Escudero, G., Suárez Falcón, J.C., y Martinez Arias, R. (1999): Aplicación de un procedimiento iterativo para la selección de modelos de la Teoria de la Respuesta al Item en una prueba de rendimiento lector. Revista de Educación, 319, 253-272.

Glass, G.V & Stanley, J.C. (1970). Statistical methods in education and psychology. Englewood Cliffs, NJ: Prentice-Hall.

Glass, G.V & Stanley, J.C. (1974). Metodos estadisticos aplicados a las ciencias sociales. London: Prentice-Hall Internacional.

Green, J. (1999). Excel 2000 VBA programmer's reference. Birmingham, England: Wrox Press.

Gronlund, N.E. (1985). Measurement and evaluation in teaching (5th ed.). New York: Collier Macmillan Publishers.

Gulliksen, H. (1950). Theory of mental test scores. New York: John Wiley & Sons.

Haladyna, T.M. (2004). Developing and validating multiple-choice test items (3rd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

Haladyna, T.M. & Rodriguez, M.C.(2013). Developing and validating test items. New York: Routledge.

Hambleton, R.K. & Jones, R.W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38-47.

Hambleton, R.K. & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer.

Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications.

Harpp, D.N. & Hogan, J.J. (1993). Crime in the classroom- detection and prevention of cheating on multiple-choice exams. Journal of Chemical Education, 70(4), 306-311.

Harpp, D.N., Hogan, J.J., & Jennings, J.S. (1996). Crime in the classroom- Part II, an update. Journal of Chemical Education, 73(4), 349-351.

Hattie, J., Jaeger, R.M., & Bond, L. (1999). Persistent methodological questions in educational testing. Review of Research in Education, 24, 393-446.

Hays, W.L. (1973). Statistics for the social sciences. London: Holt, Rinehart and Winston.

Hills, J.R. (1976). Measurement and evaluation in the classroom. Columbus, Ohio: Charles E. Merrill.

Hopkins, K.D. (1998). Educational and psychological measurement and evaluation (8th ed.). Boston: Allyn & Bacon.

Hopkins, K.D. & Glass, G.V (1978). Basic statistics for the behavioral sciences. Englewood Cliffs, NJ: Prentice-Hall.

Hopkins, K.D., Stanley, J.C., & Hopkins, B.R. (1990). Educational and psychological measurement and evaluation (7th ed.). Englewood Cliffs, NJ: Prentice-Hall.

Hoyt, C.J. (1941). Test reliability estimated by analysis of variance. Psychometrika, 6, 153-160.

Kaplan, R.M. & Sacuzzo, D.P. (1993). Psychological testing: principles, applications, and issues. Pacific Grove, California: Brooks/Cole.

Kelly, T.L. (1939). The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology, 30, 17-24.

Kerlinger, F.N. (1973). Foundations of behavioral research (2nd ed.). London: Holt, Rinehart, and Winston.

Kolen, M.J. & Brennan, R.L. (1995). Test equating: methods and practices. New York: Springer-Verlag.

Lawson, S. (1991). One parameter latent trait measurement: Do the results justify the effort? In B. Thompson (Ed.), Advances in Educational Research: Substantive findings, methodological developments (Vol. 1, pp. 159-168). Greenwich, CT: JAI.

Lindeman, R.H. & Merenda, P.F. (1979). Educational measurement (2nd ed.). London: Scott, Foresman and Company.

Linn, R.L. & Gronlund, N.E. (1995). Measurement and assessment in teaching (7th ed.). Englewood Cliffs, NJ: Prentice-Hall.

Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillside, NJ: Lawrence Erlbaum Associates.

Lord, F.M. (1984). Standard errors of measurement at different ability levels. Journal of Educational Measurement, 21(3), 239-243.

Lord, F.M. & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley.

MacDonald, P. & Paunonen, S.V. (2002). A Monte Carlo comparison of item and person statistics based on item response theory versus classical test theory. Educational and Psychological Measurement, 62(6), 921-943.

Magnusson, D. (1967). Test theory. London: Addison-Wesley.

Mehrens, W.A. & Lehmann, I.J. (1991). Measurement and evaluation in education and psychology (4th ed.). London: Holt, Rinehart and Winston.

Michaelides, M.P. (2008). An illustration of a Mantel-Haenszel procedure to flag misbehaving common items in test equating. Practical Assessment, Research & Evaluation, 13 (7). Available online: http://pareonline.net/getvn.asp?v=13&n=7

Nandakumar, R. (1994). Assessing dimensionality of a set of item responses—Comparison of different approaches. Journal of Educational Measurement, 31, 17-35.

Nelson, L.R. (1974). Guide to LERTAP use and interpretation. Dunedin, New Zealand: Department of Education, University of Otago.

Nelson, L.R. (1981). PLATISLA, an introduction to applied social science statistical methods. Dunedin, New Zealand: Department of Education, University of Otago. (Click here to link to sample data from Platisla.)

Nelson, L.R. (1984). Using microcomputers to assess achievement and instruction. Educational Measurement: Issues and Practice, 3(2), 22-26.

Nelson, L.R. (2000). Item analysis for tests and surveys using Lertap 5. Perth, Western Australia: Curtin University of Technology (www.lertap.curtin.edu.au).

Nelson, L.R. (2004). Excel as an aide in teaching measurement and research methods. Thai Journal of Educational Research and Measurement (ISSN 1685-6740): 2(1), 43-55. (Click here to link to a copy of this paper.)

Nelson, L.R. (2005). Some observations on the scree test, and on coefficient alpha. Thai Journal of Educational Research and Measurement (ISSN 1685-6740): 3(1), 1-17. (Click here to link to a copy of this paper.)

Nelson, L.R. (2006). Using selected indices to monitor cheating on multiple-choice exams. Thai Journal of Educational Research and Measurement (ISSN 1685-6740): 4(1), 1-18. (Click here to link to a copy of this paper; it was later updated substantially.)

Nelson, L.R. (2007). Some issues related to the use of cut scores. Thai Journal of Educational Research and Measurement (ISSN 1685-6740): 5(1), 1-16. (Click here to link to a copy of this paper.)

Online Press, Inc. (1997). Quick course in Microsoft Excel 97. Redmond, Washington: Microsoft Press.

Oosterhof, A.C. (1990). Classroom applications of educational measurement. Columbus, Ohio: Merrill.

Pedhazur, E.J. & Schmelkin, L.P. (1991). Measurement, design, and analysis: an integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates.

Peng, C-Y.J. & Subkoviac, M.J. (1980). A note on Huynh's normal approximation procedure for estimating criterion-referenced reliability. Journal of Educational Measurement, 17, 359-368.

Pintrich, P.R., Smith, D.A.F., Garcia, T. & McKeachie, W.J. (1991). A manual for the use of the Motivated Strategies for Learning Questionnaire (MSLQ). Ann Arbor, Michigan: the University of Michigan.

Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall.

Qualls-Payne, A.L. (1992). A comparison of score level estimates of the standard error of measurement. Journal of Educational Measurement, 29(3), 213-225.

Roussos, L.A., Schnipke, D.L., & Pashley, P.J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behaviorial Statistics, 24(3), 293-322.

Sanders, D.H. (1981). Computers in society. New York: McGraw-Hill.

Stage, C. (1998). A comparison between item analysis based on item response theory and classical test theory: a study of the SweSAT Subtest READ. Educational Measurement No 30. Umeå, Sweden: University of Umeå, Department of Educational Measurement. (Possibly available at www.umu.se/edmeas/publikationer/index_eng.html.)

Stage, C. (2003). Classical test theory or item response theory: the Swedish experience. Educational Measurement No 42. Umeå, Sweden: University of Umeå, Department of Educational Measurement.
(Possibly available at www.umu.se/edmeas/publikationer/index_eng.html;
found at the following address January 2008:
www.umu.se/edmeas/publikationer/pdf/em%20no%2042.pdf.)

Stevenson, J. (1998). Performance of the Cognitive Holding Power Questionnaire in schools. Learning and Instruction, 8(5), 393-410.

Stevenson, J.C. & Evans, G.T. (1994). Conceptualization and measurement of cognitive holding power. Journal of Educational Measurement, 31(2), 161-181.

Subkoviak, M.J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13, 265-276.

Subkoviak, M.J. (1984). Estimating the reliability of mastery-nonmastery classifications. In R.A. Berk (Ed.) A guide to criterion-referenced test construction. Baltimore, Maryland: The Johns Hopkins Press.

Thompson, B. (2004). Exploratory and confirmatory factor analysis: understanding concepts and applications. Washington, DC: The American Psychological Association.

Thompson, B. (2006). Foundations of behavioral statistics: an insight-based approach. New York: The Guilford Press.

Thorndike, R.L. (1982). Educational measurement: Theory and practice. In D. Spearitt (Ed.), The improvement of measurement in education and psychology: Contributions of latent trait theory (pp. 3-13). Princeton, NJ: ERIC Clearinghouse of Tests, Measurements, and Evaluations. (ERIC Document Reproduction Service No. ED 222 545.)

Tukey, J.W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

Wainer, H. (1989). The future of item analysis. Journal of Educational Measurement, 26, 191-208.

Wesolowsky, G.O. (2000). Detecting excessive similarity in answers on multiple choice exams. Journal of Applied Statistics, 27(7), 909-921.

Wiersma, W. & Jurs, S.G. (1990). Educational measurement and testing (2nd ed.). Boston: Allyn & Bacon.

Zieky, M. (2003). A DIF Primer. Princeton, NJ: Educational Testing Service. See: http://www.ets.org/Media/Tests/PRAXIS/pdf/DIF_primer.pdf