Out of cite
Allen, M.J. & Yen, W.M. (1979). Introduction
to measurement theory. Monterey, California: Brooks/Cole.
Angoff, W.H. (1993). Perspectives on differential item functioning methodology. In P.W. Holland and H. Wainer (Eds.), Differential Item Functioning (pp. 3-23).
Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Berk, R.A. (1980). A consumer's guide to criterion-referenced test reliability. Journal of Educational Measurement, 17, 323-350.
Berk, R.A. (1984). Selecting the index of reliability.
In R.A. Berk (Ed.) A guide to criterion-referenced test construction. Baltimore, Maryland: The Johns Hopkins Press.
Berk, R.A. (2000). Ask Mister Assessment Person.
In Teachers: Supply and demand in an age of rising standards. Amherst, MA: National Evaluation Systems, Inc. (Note: in May, 2013, the papers in this series could be found as individual PDF files by searching the internet.)
Brennan, R.L. (1972). A generalized upper-lower
discrimination index. Educational and Psychological Measurement, 32, 289-303.
Brennan, R.L. (1984). Estimating the dependability
of the scores. In R.A. Berk (Ed.) A guide to criterion-referenced
test construction. Baltimore, Maryland: The Johns Hopkins Press.
Brennan, R.L. & Kane, M.T.
(1977). An index of dependability for mastery tests. Journal
of Educational Measurement, 14, 277-289.
Brown, M.B. (1977). Algorithm AS 116: the tetrachoric
correlation and its standard error. Applied Statistics,
26, 343-351.
Camilli, G. & Shepard, L.A.
(1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Publications.
Carr, N.T.
(2011). Designing and analyzing language tests. Oxford: Oxford University Press.
Case, S.M. & Swanson, D.B.
(1998). Constructing written test questions for the basic and
clinical sciences. Philadelphia: National Board of Medical
Examiners. Refer to www.nbme.org/about/itemwriting.asp.
Cattell, R.B. (1966). The scree test for the number
of factors. Multivariate Behavioral Research, 1, 245-276.
Cizek, G.J. (2001). An overview of issues concerning cheating on large-scale tests. Paper presented at the annual meeting of NCME, the National Council on Measurement and Evaluation, April 2001, Seattle, Washington. PDF copy possibly available via: http://www.natd.org/Cizek%20Symposium%20Paper.PDF
Clauser, B.E., & Mazor, K.M.
(1998). An NCME instructional module on using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44.
Cohen, J. (1960). A coefficient of agreement for
nominal scales. Educational and Psychological Measurement, 20, 37-46.
Crocker, L.M. & Algina, J.
(1986). Introduction to classical and modern test theory.
New York: Holt, Rinehart, and Winston.
Dawber, T. (2004). Robustness
of Lord's formulas for item difficulty and discrimination conversions
between classical and item response theory models. Edmonton,
Alberta: unpublished doctoral dissertation, University of Alberta
(also see following reference).
Dawber, T., Rogers, W.T., &
Carbonaro, M. (2004). Robustness of Lord's
formulas for item difficulty and discrimination conversions between
classical and item response theory models. Paper presented
at the annual meeting of AERA, the American Educational Research
Association, April 12, 2004, San Diego, California. PDF copy possibly
available via: www.education.ualberta.ca/educ/psych/crame/research.htm.
de la Harpe, B.I. (1998). Design, implementation,
and evaluation of an in-context learning support program for first
year education students and its impact on educational outcomes.
Perth, Western Australia: unpublished doctoral dissertation, Curtin
University of Technology.
Dimitrov, D.M. (2003). Reliability and true-score measures of binary items as a function of their Rasch difficulty parameter. Journal of Applied Measurement, 4(3), 222-233.
Dorans, N.J. & Holland, P.W.
(1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland and H. Wainer (Eds.), Differential Item Functioning (pp. 35-66).
Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Dorans, N.J. & Kulick, E.
(2006). Differential item functioning on the Mini-Mental State Examination: an application of the Mantel-Haenszel and standardization procedures. Medical Care, 44(11), S107-S114.
Du Toit, M. (Ed.) (2003). IRT from SSI: BILOG-MG,
MULTILOG, PARSCALE, TESTFACT. Lincolnwood (IL): Scientific
Software International.
Eason, S. (1991). Why generalizability theory
yields better results than classical test theory: a primer with
concrete examples. In B. Thompson (Ed.), Advances in Educational
Research: Substantive findings, methodological developments
(Vol. 1, pp. 83-98). Greenwich, CT: JAI.
Ebel, R.L. & Frisbie, D.A.
(1986). Essentials of Educational Measurement (4th ed.).
Sydney: Prentice-Hall of Australia.
Fan, X. (1998). Item response theory and classical
test theory: an empirical comparison of their item/person statistics.
Educational and Psychological Measurement, 58 (3), 357-381.
Feldt, L.S. (1984). Some relationships between the binomial error model and classical test theory. Educational and Psychological Measurement, 44, 883-891.
Frederiksen, N., Mislevy, R.J., & Bejar, I.I. (Eds.) (1993). Test theory
for a new generation of tests. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Garrett, H.E. (1952). Testing for teachers.
New York: American Book Company.
Gil Escudero, G., Suárez Falcón,
J.C., y Martinez Arias, R. (1999): Aplicación
de un procedimiento iterativo para la selección de modelos
de la Teoria de la Respuesta al Item en una prueba de rendimiento
lector. Revista de Educación, 319, 253-272.
Glass, G.V & Stanley, J.C.
(1970). Statistical methods in education and psychology.
Englewood Cliffs, NJ: Prentice-Hall.
Glass, G.V & Stanley, J.C.
(1974). Metodos estadisticos aplicados a las ciencias sociales.
London: Prentice-Hall Internacional.
Green, J. (1999). Excel 2000 VBA programmer's
reference. Birmingham, England: Wrox Press.
Gronlund, N.E. (1985). Measurement and evaluation
in teaching (5th ed.). New York: Collier Macmillan Publishers.
Gulliksen, H. (1950). Theory of mental test
scores. New York: John Wiley & Sons.
Haladyna, T.M. (2004). Developing and validating
multiple-choice test items (3rd ed.). Hillsdale, NJ: Lawrence Erlbaum
Associates.
Haladyna, T.M. & Rodriguez, M.C.(2013). Developing and validating test items. New York: Routledge.
Hambleton, R.K. & Jones,
R.W. (1993). Comparison of classical test theory and item response
theory and their applications to test development. Educational
Measurement: Issues and Practice, 12(3), 38-47.
Hambleton, R.K. & Swaminathan,
H. (1985). Item response theory: Principles and applications.
Boston: Kluwer.
Hambleton, R.K., Swaminathan,
H., & Rogers, H.J. (1991). Fundamentals
of item response theory. Newbury Park, California: Sage Publications.
Harpp, D.N. & Hogan, J.J. (1993). Crime in the classroom- detection and prevention of cheating on multiple-choice exams. Journal of Chemical Education, 70(4), 306-311.
Harpp, D.N., Hogan, J.J., & Jennings, J.S. (1996). Crime in the classroom- Part II, an update. Journal of Chemical Education, 73(4), 349-351.
Hattie, J., Jaeger, R.M., & Bond, L. (1999). Persistent methodological questions in educational testing. Review of Research in Education, 24, 393-446.
Hays, W.L. (1973). Statistics for the social
sciences. London: Holt, Rinehart and Winston.
Hills, J.R. (1976). Measurement and evaluation
in the classroom. Columbus, Ohio: Charles E. Merrill.
Hopkins, K.D. (1998). Educational and psychological
measurement and evaluation (8th ed.). Boston: Allyn & Bacon.
Hopkins, K.D. & Glass, G.V
(1978). Basic statistics for the behavioral sciences. Englewood
Cliffs, NJ: Prentice-Hall.
Hopkins, K.D., Stanley, J.C.,
& Hopkins, B.R. (1990). Educational and
psychological measurement and evaluation (7th ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Hoyt, C.J.
(1941). Test reliability estimated by analysis of variance. Psychometrika, 6, 153-160.
Kaplan, R.M. & Sacuzzo, D.P.
(1993). Psychological testing: principles, applications, and
issues. Pacific Grove, California: Brooks/Cole.
Kelly, T.L. (1939). The selection of upper and
lower groups for the validation of test items. Journal of Educational
Psychology, 30, 17-24.
Kerlinger, F.N. (1973). Foundations of behavioral
research (2nd ed.). London: Holt, Rinehart, and Winston.
Kolen, M.J. & Brennan, R.L.
(1995). Test equating: methods and practices. New York:
Springer-Verlag.
Lawson, S. (1991). One parameter latent trait
measurement: Do the results justify the effort? In B. Thompson (Ed.),
Advances in Educational Research: Substantive findings, methodological
developments (Vol. 1, pp. 159-168). Greenwich, CT: JAI.
Lindeman, R.H. & Merenda,
P.F. (1979). Educational measurement (2nd ed.). London:
Scott, Foresman and Company.
Linn, R.L. & Gronlund, N.E.
(1995). Measurement and assessment in teaching (7th ed.).
Englewood Cliffs, NJ: Prentice-Hall.
Lord, F.M. (1980). Applications of item response
theory to practical testing problems. Hillside, NJ: Lawrence
Erlbaum Associates.
Lord, F.M. (1984). Standard errors of measurement at different ability levels. Journal of Educational Measurement, 21(3), 239-243.
Lord, F.M. & Novick, M.R.
(1968). Statistical theories of mental test scores. Reading,
Massachusetts: Addison-Wesley.
MacDonald, P. & Paunonen,
S.V. (2002). A Monte Carlo comparison of item and person statistics
based on item response theory versus classical test theory. Educational
and Psychological Measurement, 62(6), 921-943.
Magnusson, D. (1967). Test theory. London:
Addison-Wesley.
Mehrens, W.A. & Lehmann,
I.J. (1991). Measurement and evaluation in education and psychology
(4th ed.). London: Holt, Rinehart and Winston.
Michaelides, M.P. (2008). An illustration of a Mantel-Haenszel procedure to flag misbehaving common items in test equating. Practical Assessment, Research & Evaluation, 13 (7). Available online: http://pareonline.net/getvn.asp?v=13&n=7
Nandakumar, R. (1994). Assessing dimensionality
of a set of item responses—Comparison of different approaches.
Journal of Educational Measurement, 31, 17-35.
Nelson, L.R. (1974). Guide to LERTAP use and
interpretation. Dunedin, New Zealand: Department of Education,
University of Otago.
Nelson, L.R. (1981). PLATISLA, an introduction
to applied social science statistical methods. Dunedin, New
Zealand: Department of Education, University of Otago. (Click
here to link to sample data from Platisla.)
Nelson, L.R. (1984). Using microcomputers to assess
achievement and instruction. Educational Measurement: Issues
and Practice, 3(2), 22-26.
Nelson, L.R. (2000). Item analysis for tests
and surveys using Lertap 5. Perth, Western Australia: Curtin
University of Technology (www.lertap.curtin.edu.au).
Nelson, L.R. (2004). Excel as an aide in teaching
measurement and research methods. Thai Journal of Educational
Research and Measurement (ISSN 1685-6740): 2(1), 43-55. (Click
here to link to a copy of this paper.)
Nelson, L.R. (2005). Some observations on the
scree test, and on coefficient alpha. Thai Journal of Educational
Research and Measurement (ISSN 1685-6740): 3(1), 1-17. (Click
here to link to a copy of this paper.)
Nelson, L.R. (2006). Using selected indices to monitor cheating on multiple-choice exams. Thai Journal of Educational
Research and Measurement (ISSN 1685-6740): 4(1), 1-18. (Click
here to link to a copy of this paper; it was later updated substantially.)
Nelson, L.R. (2007). Some issues related to the use of cut scores. Thai Journal of Educational
Research and Measurement (ISSN 1685-6740): 5(1), 1-16. (Click
here to link to a copy of this paper.)
Online Press, Inc. (1997). Quick course in
Microsoft Excel 97. Redmond, Washington: Microsoft Press.
Oosterhof, A.C. (1990). Classroom applications
of educational measurement. Columbus, Ohio: Merrill.
Pedhazur, E.J. & Schmelkin,
L.P. (1991). Measurement, design, and analysis: an integrated
approach. Hillsdale, NJ: Lawrence Erlbaum Associates.
Peng, C-Y.J. & Subkoviac,
M.J. (1980). A note on Huynh's normal approximation procedure for estimating criterion-referenced reliability. Journal of Educational Measurement, 17, 359-368.
Pintrich, P.R., Smith, D.A.F., Garcia, T. & McKeachie, W.J.
(1991). A manual for the use of the Motivated Strategies for
Learning Questionnaire (MSLQ). Ann Arbor, Michigan: the University
of Michigan.
Popham, W.J. (1978). Criterion-referenced
measurement. Englewood Cliffs, NJ: Prentice-Hall.
Qualls-Payne, A.L. (1992). A comparison of score level estimates of the standard error of measurement. Journal of Educational Measurement, 29(3), 213-225.
Roussos, L.A., Schnipke, D.L., & Pashley, P.J.
(1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behaviorial Statistics, 24(3), 293-322.
Sanders, D.H. (1981). Computers in society. New York: McGraw-Hill.
Stage, C. (1998). A comparison between item analysis
based on item response theory and classical test theory: a study
of the SweSAT Subtest READ. Educational Measurement No
30. Umeå, Sweden: University of Umeå, Department of
Educational Measurement. (Possibly available at www.umu.se/edmeas/publikationer/index_eng.html.)
Stage, C. (2003). Classical test theory or item response theory: the Swedish experience. Educational Measurement No
42. Umeå, Sweden: University of Umeå, Department of
Educational Measurement.
(Possibly available at www.umu.se/edmeas/publikationer/index_eng.html;
found at the following address January 2008:
www.umu.se/edmeas/publikationer/pdf/em%20no%2042.pdf.)
Stevenson, J. (1998). Performance of the Cognitive
Holding Power Questionnaire in schools. Learning and Instruction, 8(5), 393-410.
Stevenson, J.C. & Evans,
G.T. (1994). Conceptualization and measurement of cognitive holding
power. Journal of Educational Measurement, 31(2), 161-181.
Subkoviak, M.J. (1976). Estimating reliability
from a single administration of a criterion-referenced test. Journal
of Educational Measurement, 13, 265-276.
Subkoviak, M.J. (1984). Estimating the reliability
of mastery-nonmastery classifications. In R.A. Berk (Ed.) A
guide to criterion-referenced test construction. Baltimore,
Maryland: The Johns Hopkins Press.
Thompson, B. (2004). Exploratory and confirmatory factor analysis: understanding concepts and applications. Washington, DC: The American Psychological Association.
Thompson, B. (2006). Foundations of behavioral statistics: an insight-based approach. New York: The Guilford Press.
Thorndike, R.L. (1982). Educational measurement:
Theory and practice. In D. Spearitt (Ed.), The improvement of
measurement in education and psychology: Contributions of latent
trait theory (pp. 3-13). Princeton, NJ: ERIC Clearinghouse
of Tests, Measurements, and Evaluations. (ERIC Document Reproduction
Service No. ED 222 545.)
Tukey, J.W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Wainer, H. (1989). The future of item analysis. Journal of Educational Measurement, 26, 191-208.
Wesolowsky, G.O. (2000). Detecting excessive similarity in answers on multiple choice exams. Journal of Applied Statistics, 27(7), 909-921.
Wiersma, W. & Jurs, S.G.
(1990). Educational measurement and testing (2nd ed.).
Boston: Allyn & Bacon.
Zieky, M. (2003). A DIF Primer. Princeton, NJ: Educational Testing Service. See: http://www.ets.org/Media/Tests/PRAXIS/pdf/DIF_primer.pdf
|