*By Susan M. McMillan, PhD*

Our 2003 paper on point-biserial correlations and *p*-values, item statistics from classical test theory, has been one of our most popular publications. This three-part blog provides a slightly revised and refreshed explanation for how item statistics can be used by educators to improve classroom assessments. Part 1 is a nontechnical overview of item analysis and explains why we recommend it for classroom assessments. Part 2 describes how to compute *p*-values and point-biserial correlations using an Excel spreadsheet, and Part 3 discusses the interpretation of item statistics.

## Part 1: Item Analysis Overview

Educators routinely develop and administer classroom tests throughout the academic year. These tests are perceived as low stakes because they are often diagnostic tools and are not used to measure student growth over time or for statewide reporting purposes. However, classroom tests are important because they are used to inform instruction and curriculum development and to steer teacher professional development. Further, classroom assessments provide information to students about their educational journey. For these reasons, we believe that classroom tests should be reviewed for technical quality. Classroom assessments should provide information that is meaningful and consistent, whether teachers write the questions or put together an assessment from a commercially available item bank.

### What is Item Analysis?^{1}

Assessment specialists use the word “item” to refer to multiple types of questions (multiple-choice, true/false, matching, constructed response, and performance task) on multiple types of assessments (graded assignments, quizzes, or tests). Item analysis is a method of reviewing items for quality and is both qualitative and quantitative.

*Qualitative analysis of classroom assessment items involves teachers as content experts checking for whether the items:*

- Match the content standards being assessed
- Are written clearly to prevent confusion
- Have one correct answer (multiple-choice items)
- Provide clear expectations for written answers or performances that can be graded fairly (constructed response and performance-task items)
- Provide all students with the same chances for success (unbiased)
- Help educators answer education-related questions about their students

Qualitative reviews happen prior to administering the items to prevent student stress when faced with confusing questions and to ensure that each item contributes to the goal of collecting educationally useful information.

*Quantitative item analysis* happens after the items have been administered and scored. The student responses and item scores provide numeric data that is reviewed for clues about the quality of educational information produced by each item. Assessment experts sometimes refer to “item behavior” to mean how well an item functions as an indicator of student knowledge.

### Why Review Items for Quality?

The goal of classroom assessment is to understand what students know and can do and to use that information to make education-related decisions. If the items on assignments, quizzes, and tests do not contribute to that goal, they could be a waste of student and teacher time (at best) or counter-productive (at worst). Item analysis provides evidence that a test measures what it is supposed to measure and that it produces consistent results^{2}. Reviewing item quality helps to ensure that educators are obtaining the best possible information to make instructional decisions.

Item quality is related to the length of assessments; poor quality items produce information that is not directly relevant to the content or is inconsistent about what students know. When test items are well-written and cover relevant content, educators need fewer of them to obtain consistent results. Better assessment items reduce the testing burden on both students and teachers.

One common example of how poor items complicate decision-making is that word problems on an algebra test may be measuring both algebra and English language skills. Another common example is that reading comprehension items can be measuring background knowledge in addition to reading comprehension. Educators trying to place students in algebra or reading classes would have trouble making decisions based on tests with these types of multidimensional items. The test might need to be longer to provide consistent information that answers the educational questions.

Large-scale assessment programs have psychometricians, measurement scientists, who perform highly sophisticated technical item analysis, but *p*-values and point-biserial correlations are quantitative tools that are available and accessible to classroom educators.

### What are *P*-Values and Point-Biserial Correlations?

*P*-values provide information about item difficulty^{3}, and point-biserial correlations, also called item-total correlations, are indicators of how the scores on each item are related to the overall test scores, often called “discrimination.”

For items that have only two score points (correct or incorrect), the *p*-value is the proportion of students who got the item correct. As an indicator of difficulty, it is reverse coded; a high value means the item was relatively easy, and a low value means it was more difficult. *P*-values range from 0.0 (no students got the item correct) to 1.0 (all students answered correctly). Interpreting *p*-values depends upon the purpose of the assessment and is covered in more detail in “Part 3: Interpreting Item Statistics.”

Item difficulty for items with more than two score points (for example, a constructed-response item that is scored as 0, 1, or 2) can be indicated by the “adjusted item mean.”^{4} The adjusted item mean is the mean item score expressed as a proportion of the total score points possible on that item. While not technically a *p*-value, it has the same 0.0 to 1.0 range. Our examples show items scored as correct/incorrect, and the adjusted item mean is addressed in the footnotes.

The point-biserial correlation is the correlation between the scores students receive on a given item and the total scores students receive on all other items on the test.^{5} The correlation will be high if students who get the item correct also have higher total test scores (minus the item in question), and students who get the item wrong also have lower total test scores. The correlation shows whether the item discriminates between students who know the material well and those who do not. The range is -1.0 to 1.0, but negative values indicate a problem with the item or the key. Interpretation of the point-biserial correlation is discussed in “Part 3: Interpreting Item Statistics.”

These two tools of classical test theory, *p*-values and point-biserial correlations, can be used to identify problematic items on classroom assessments. When teachers either revise or remove the items, the quality of information provided by the assessment improves.

##### Footnotes

1. The item analysis covered in this blog comes exclusively from classical test theory since those concepts and calculations are more accessible for educators with little formal psychometric training.

2. The technical concepts of validity and reliability as they relate to test scores are beyond the scope of this article. For a clear introduction see this article.

3. This usage should not be confused with the “*p*-value” in statistics, which refers to the probability of rejecting a null hypothesis. We note that more sophisticated measures of item difficulty and discrimination can be computed using item-response theory models, but classical test theory tools give reasonable estimates and are accessible to educators who may have little training in psychometrics.

4. California English Language Development Test Technical Report 2017—18 Edition, p. 64.

5. This is called the “corrected” point-biserial correlation because it does not correlate the item in question with its own score. Correlations can also be computed for items with more than two score points such as a constructed-response item with score points 0, 1, and 2, and the interpretation is similar.