Evaluating the Quality of Standardized Assessments

By Yvette Arañas

Selecting the right assessment to use at the class-, school-, or district-wide level can be a daunting task. There are many assessments available on the market, but not all tests are equal in accurately measuring students’ skills. Thus, it can be difficult to decide which ones would be most appropriate to use. Here are a few things to consider when evaluating the quality of an assessment.

Reliability
When evaluating the quality of a standardized assessment, we usually look at a technical manual. One piece of information that the manual should give is the assessment’s evidence of reliability, or its consistency in measuring students’ performance. When looking at the manual, you’ll find that reliability is often shown as correlations. According to Salvia, Bolt, & Ysseldyke (2013), screening assessments should have correlations of at least .80; assessments used for progress monitoring should have at least .70; and those used to make high-stakes decisions (e.g., special education eligibility determination) should have at least .90.

One way to look at reliability is to see how consistent an assessment measures the same students’ scores across time. This is referred to as test-retest reliability. This is especially important for measuring characteristics that stay fairly stable over time. In the technical manual, a correlation should be provided between scores from one time point to another time point.

We also want to make sure that the questions on an assessment are internally consistent, or measuring the same thing. To show evidence of this, test developers often provide a correlation between responses to questions from one half of the test to the other half. If an assessment has multiple forms, responses should also be correlated highly across all the forms.

When more than one rater is required to administer the test, the manual should provide either the correlations between raters’ scores, or their percentage of agreement. This is called inter-rater reliability. This is especially important for assessments that require raters to observe student behaviors.

Validity
In addition to examining a reliability, it is also important to examine evidence of validity, or the extent to which an assessment is measuring what it is intending to measure.

One way to show evidence of validity is to provide correlations between performances on the assessment of interest and another assessment. This is referred to as criterion-related validity. If your assessment is intended to measure vocabulary, its scores should be highly correlated with a well-established assessment of vocabulary. If you want to predict something in the future such as college academic achievement, scores on the assessment should be highly correlated with scores on an assessment given in college. As a rule of thumb, correlations with similar assessments should be at least .80.

Another piece of evidence for validity to consider is content validity, or how well the test content represents the skill that was intended to be measured. For example, an assessment that measures math fact fluency for grades 1 and 2 should include one-digit subtraction and addition problems. Furthermore, it should be a timed task because fluency needs to be accounted for. Certain types of questions should also be excluded to maintain content validity. In the case of a math fact fluency assessment, multi-step math problems that take a lot of time to complete should be excluded.

Norms
Sometimes, we want to know how a student’s performance on a standardized assessment compares to that of a normative group, or a sample of students who have already taken the same assessment. This type of assessment should provide norms, or the distribution of scores from the normative sample. Norms allow us to reference a student’s score in relation to the scores of other test-takers. Many assessments allow us to make comparisons to students nation-wide, or at the local level.

Before comparing a student’s score to the normative group, first, make sure that the normative group represents your student well. For example, a comparison would not be appropriate if the student is a preschooler and the normative sample only consisted of students from upper grade levels. The assessment’s technical manual should provide their normative sample’s demographics (e.g., race, ethnicity, gender, age, socio-economic status, geography, disability categories).

Second, norms should be recent. Average scores tend to change over time. Because of this, we recommend using the most recent norms and checking how long ago information was collected from the normative group. For achievement tests, norms should be no more than seven years old (Salvia, Ysseldyke, & Bolt, 2013).

Although evaluating assessments can be overwhelming, we hope that the guidelines provided above will help you select the best possible assessments to use with your students.

For more information about reliability and validity of the FAST assessments, click to request the FastBridge Learning technical manual.

And, be sure to revisit our previous blog post on Assessment Literacy for more information.

References
Salvia, J., Ysseldyke, J., & Bolt, S. (2012). Assessment: In special and inclusive education. Cengage Learning.

Yvette Arañas is a doctoral student at the University of Minnesota. She was a part of FastBridge Learning’s research team for four years and contributed to developing the FAST™ reading assessments. Yvette is currently completing an internship in school psychology at a rural district in Minnesota.

Share This Story