Evaluating the Content and Quality of Next Generation Assessments

We at the Thomas B. Fordham Institute have been evaluating the quality of state academic standards for nearly 20 years. For most of those two decades, we’ve also dreamed of evaluating the tests linked to those standards. They’re what schools (and sometimes teachers and students) are held accountable to and they tend to drive actual curricula and instruction. We wanted to know how well-aligned the assessments were to the standards, whether they were of high quality, and what type of cognitive demands they placed on students.
Fortunately, we had the opportunity to study those questions nearly two years ago. We released our results in early February. Our report provides an in-depth appraisal of the content and quality of three “next generation” assessments—ACT Aspire, PARCC, and Smarter Balanced—and one best-in-class state test, the Massachusetts Comprehensive Assessment System (MCAS, 2014). In total, more than 13 million children (about 40 percent of the country’s students in grades 3–11) took one of these four tests in spring 2015.

To conduct the study, we partnered with assessment experts Nancy Doorey and Morgan Polikoff and recruited nearly 40 other educators, content and assessment experts. The testing programs graciously granted our reviewers secure access to their “live” operational items and test forms for grades 5 and 8 (the elementary and middle school capstone grades that are this study’s focus).

Our panels used a new methodology developed by the Center for Assessment to evaluate the four tests—a methodology that was itself based on the Council of Chief State School Officers’ (CCSSO) 2014 “Criteria for Procuring and Evaluating High-Quality Assessments.” The CCSSO criteria address the content and depth of state tests in both English language arts (ELA) and mathematics.

The panels evaluated the extent of the match between the assessment and a key element of the CCSSO document. They assigned one of four “match” ratings to each ELA and math-specific criterion: Excellent, Good, Limited/Uneven, or Weak Match.

What did they find?

PARCC and Smarter Balanced assessments earned an excellent or good match to the subject-area CCSSO Criteria for both ELA/literacy and mathematics. This was the case with both content and depth.

ACT Aspire and MCAS (along with the others) also did well regarding the quality of their items and the depth of knowledge assessed (both of which are part of the depth rating). But the panelists also found that they did not adequately assess—or in some cases did not really assess at all—some of the priority content in both ELA/literacy and mathematics at one or both grade levels in the study.

What do we make of these bottom-line results? Simply put, developing a test—like all major decisions in life—is full of trade-offs. PARCC and Smarter Balanced are a better match to the CCSSO Criteria, which is not surprising, given that they were both developed with the Common Core State Standards (CCSS) in mind. ACT Aspire and MCAS, on the other hand, were not developed for that explicit purpose.

Although PARCC and Smarter Balanced are a better match to the criteria, they also take longer to administer and can be more expensive for some states. The longer testing times are due primarily to the inclusion of extended performance tasks. Both programs use these tasks to assess high-priority skills within the CCSS, such as the development of written compositions in which a claim is supported with evidence and solving complex multi-step problems in mathematics. These tasks are also typically costlier to develop and score.

Another trade-off pertains to inter-state comparability. Some states want the autonomy that comes with having their own state test developed by their own educators. Other states prioritize the ability to compare their students with those in other states via a multi-state test. We think the extra time and money, plus the comparability advantage, are trade-offs worth making, but we can’t pretend that they’re not tough decisions in a time of tight budgets and widespread anxiety about testing burden.

We suspect that much of the real concern has to do with the pressure that educators feel to teach to the test and narrow the curriculum. If we’re right, the answer is stronger tests, which encourage better, broader, richer instruction, and which make traditional “test prep” ineffective.

Our point is not to advocate for any particular tests, but to root for those that have qualities that enhance, rather than constrict, a child’s education and give her the best opportunity to show what she’s learned.

A discussion of such qualities, and the types of trade-offs involved in obtaining them, are precisely the kinds of conversations that merit honest debate in states and districts.

To learn more, read the Fordham Institute’s report: Evaluating the Content and Quality of Next Generation Assessments

Evaluating the Content and Quality of Next Generation Assessments

Get Informed & Help Us Make An Impact

Contact Us

Quick Resources

Evaluating the Content and Quality of Next Generation Assessments

Share This

Get Informed & Help Us Make An Impact

Share This Resource!

Contact Us

Quick Resources