Well here we are, at last. As I may have mentioned, I always check on the quality of the data I have on hand before getting into the various statistical and graphical summaries which Lertap and Excel produce.
I started out my data quality snooping in the original Data worksheet with 3,393 students. I found four students without a valid Gender code, and eliminated them, leaving 3,389 students.
I then noticed that some of the students appeared to be leaving many questions unanswered. I created a "9s score", a count of the number of questions a student did not answer. I found that almost 500 students had failed to answer twelve or more items, half the number of items on the test. I eliminated them (well no, not the students themselves, just their data records), leaving a final Data sheet with 2,904 students.
Now it is true that before arriving at this topic I went ahead and used the scores from this 24-item test in a couple of ways. For example, I looked for grade-level differences, and for age differences. I really shouldn't have done this before checking on the quality of the test itself. I'll turn to that now.
As I do, I will assume that the test is one that's meant to discriminate, that is, used to identify the strongest students, separating them from the weaker ones. Such tests are commonly used as part of the process of assigning an achievement descriptor to students, an indicator of how well they have done, such as "excellent", "good", "adequate", and "poor" (or, perhaps, A, B, C, and D grades). (Another common type of test is one which uses some sort of cut-off score to classify students as "masters"/"non-masters", or "pass"/"fail". Lertap has special tools for looking at these tests; such tools are described elsewhere.)
Tests meant to discriminate should have good reliability. For a test to have good reliability, its items have to have a demonstrated ability to discriminate; their discrimination "index" should be high.
The following topics will look at item discrimination and test reliability, using both tables and graphs. As you'll see, a couple of the test's items could have performed better, and, if they had, the test's reliability would have been a bit better.
Another look at reliability is next.