Understanding Your Exam Analysis Report
The score report (see example) is an important tool that will help you evaluate the effectiveness of a test and of the individual questions that comprise it. The evaluation process, called item analysis, can improve future test and item construction. The analysis provides valuable information that helps instructors determine:
- Which are the “best” test questions to secure and continue to use on future course assessments
- Which items need review and potential revision before a next administration
- Which are the poorest items which should be eliminated from scoring on the current administration
Using the score report can also point instructors toward content that may require clarification or additional instruction. The following describes what the numbers mean and how to use them.
Information below should be used with caution. The indices that are described are inter-related and must be interpreted in context. If you have questions about interpretation please contact the Schreyer Institute for Teaching Excellence (SITE). SITE also offers periodic workshops to assist instructors in conducting item analyses.
TEST SCORE RELIABILITY
Score reliability is an indication of the extent to which your test measures a single topic such as “knowledge of the battle of Gettysburg” or “skill in solving accounting problems.” Measures of internal consistency indicate how well the questions on the test consistently and collectively address a common topic or construct. Students who answer one question on a particular topic correctly should also respond correctly to similar questions.
Scanning Operations uses the Cronbach's Alpha, which provides reliability information about items scored dichotomously (i.e., correct/incorrect), such as multiple choice items. In cases where an incorrect response distracted students that did well on the exam, exhibited by a high R value, should result in a lower PBS score. PBS score ranges from -1.0 to 1.0, with a minimum desired score greater than 0.15. If a single test is weighted heavily as part of students’ grades, reliability must be high. Low score reliability is an indication that, if students took the same exam again, they might get a different score. Optimally, we would expect to see consistent scores on repeated administrations of the same test.
There are two numbers in the “item” column: The item number and the percent of students that answered the item correctly. A higher percentage indicates an easier item; a lower percentage indicates a more difficult item. It is good to gauge this difficulty index against what you expect. You should find a higher percentage of students correctly answering items you think should be easy and a lower percentage correctly answering items you think should be difficult.
Item difficulty is also important as you try to determine how well an item “worked” to separate students who know the content from those who do not (see item discrimination below). Certain items do not discriminate well. Very easy questions and very difficult questions, for example, are poor discriminators. That is, when most students get the answer correct, or when most answer incorrectly, it is difficult to ascertain who really knows the content.
PBS (ITEM DISCRIMINATION)
Item discrimination is an indicator of how well a particular item effectively separates the students who know the test content from the students who do not. Calculated as a point bi-serial (PBS) correlation coefficient, item discrimination is an index of the degree to which students with high overall exam scores also got a particular item correct. Ideally, the discrimination (PBS) value should be >.20.
- An item with a negative PBS must be revised, as it may be an indication of an ambiguous question or a miskeyed correct response.
- A PBS of .00 results when all test takers choose the correct answer. (Recall that very easy items do not discriminate well. For some content, however, you may want assurance that all students know the answer to a particular item.)
- Items with a PBS between .00 and .20 should be examined as further refinement may improve item performance.
Beside each item are the response options (A-J). One is the key (correct response) and the others, the distractors (plausible, but incorrect, responses). The grey shaded option for each item indicates the key.
The TTL column indicates the number of students that selected that particular option.
The R column provides two pieces of information: the test mean and the standard deviation for the set of students who chose that particular distractor. For items that work well, we would expect the mean to be relatively high for the correct option and relatively low for the incorrect options. Conversely, incorrect response options that register a high mean should be carefully inspected to determine why the higher scorers on the exam (i.e., the better students) selected the incorrect answer.
The second value under the “R” column is the standard deviation, or the spread of scores, around the mean indicated for students who selected each option. Again, for the correct option (i.e., the key), we would expect to see a relatively high mean and relatively low standard deviation. In other words, the people who chose the correct answer are the ones with the overall higher test scores and there is relatively little variance among them.