Potential effects of question sampling on test fairness

There are certain consequences that flow from tests that use sampling groups of questions that are not parallel. If one question in a sampling group is much more or much less difficult than the other questions, then the tests created from sampling these groups will be variable with respect to overall test difficulty. Another way of saying this is that some forms of the test will be easier and others will be more difficult. The score assigned to each student will depend on what particular combination of questions a student answered in addition to his/her preparation and ability. Using tests that vary with respect to average difficulty is a serious threat to the validity of the scores. The situation can be truly grave if only a few sampling groups have questions that have very different levels of difficulty within the group.

If the questions in a sampling group do not work in the same way, then how students respond to the questions will be variable, that is, the questions will present largely different tasks for students. This can arise from a variety of sources. For example, the distracter choices may trap a different set of misconceptions and using the none of the above response in some questions but not others in a sampling group can result in questions having different demand characteristics. By demand characteristics of a question, we refer to the complexity of the thought processes needed to evaluate responses and then select an answer.

If questions from sampling groups do not all pertain to the same instructional goal or objective, then tests generated from these groups will be different in terms of what objectives of the course are evaluated in the different forms of the test. The scores based on tests that differ in terms of objectives assessed will not generally be comparable. The use of tests that are variable in terms of the objectives assessed cannot be encouraged.

Instructors whose discipline is substantially numeric in nature can with little effort create groups of parallel questions where a numeric parameter simply changes value from one question in a sampling group to another. If using multiple choice questions, then instructors will have to ensure that the distracters trap an identical set of misconceptions and/or errors. Instructors whose discipline involves mostly or entirely words will generally find it more difficult to create parallel groups of questions simply because making language sufficiently precise is a more difficult proposition than changing numeric parameters. Instructors may discover that it is relatively easy to write parallel forms of questions in some topical areas and not others. In this case it is advisable to only use sampling of questions in those topic areas that lend themselves to the creation of questions that have a good chance of being effective parallel questions.

Experience shows that there are specific practices that have been verified to not result in parallel groups of questions. Creating parallel questions by creating a positively stated version (identification) and a negatively worded version (exception) of the same question often occurs to inexperienced question writers. It involves changing the language of one question from positive to negative by adding the word, not, to the stem and then adjusting the set of responses. The problem with this practice is that the logic demand of the two questions (those asking what's not true and those asking what is true) are substantially different in terms of what test takers must do to select an answer. Using not to quickly create questions in a parallel group will nearly always result in questions that are not parallel because they will not be equally difficult. Instructors are advised to resist the temptation to use this approach as a way out of the hard work of crafting parallel forms of questions.