How Standardized Tests Are Scored (Hint: Humans Are Involved) | KERA News

How Standardized Tests Are Scored (Hint: Humans Are Involved)

Jul 13, 2015
Originally published on July 9, 2015 9:58 am

Standardized tests tied to the Common Core are under fire in lots of places for lots of reasons. But who makes them and how they're scored is a mystery.

For a peek behind the curtain, I traveled to the home of the nation's largest test-scoring facility: San Antonio.

The facility is one of Pearson's — the British-owned company that dominates the testing industry in the U.S. and is one of the largest publishing houses behind these mysterious standardized tests.

The company scores its test results in 21 centers across the country. The one in San Antonio is the largest.

The building is located in an office park right off a long stretch of highway northeast of the city.

Inside the cavernous 58,000-square-foot space, folding tables connected end to end fill the room. About 100 scorers from all walks of life work in silence, two per table.

They're here scoring PARCC tests — a test aligned with the Common Core Standards — aimed to replace old state exams. PARCC, short for Partnership for Assessment of Readiness for College and Careers, is one of two testing consortia that has helped states develop these new tests with $360 million in federal funding.

More than 5 million elementary and high school students took the PARCC test this school year in math, reading and writing.

In the facility, scorers' eyes are glued to computer screens displaying students' work from 10 states and the District of Columbia. These states belong to a consortium that pays Pearson $129 million to write the test questions and score the results.

Donna Vickers, a retired elementary school teacher who has worked for Pearson for eight years now, says the writing portion of this test must be scored by humans, not machines.

"I'm scoring third-grade compositions, probably four questions out of maybe 50 questions, a very small portion of an entire test," Vickers says.

She looks for evidence that students understood what they read, that their writing is coherent and that they used proper grammar. But it's actually not up to Vickers to decide what score a student deserves. Instead, she relies on a three-ring binder filled with "anchor papers." These are samples of students' writing that show what a low-score or a high-score response looks like.

"I compare the composition to the anchors and see which score does the composition match more closely," Vickers says.

Pearson does not allow reporters to describe or provide examples of what students wrote because otherwise, company officials say, everybody would know what's on the test.

So here's a writing exercise Pearson did approve:

It's from a book titled Eliza's Cherry Trees: Japan's Gift to America. The task is for third-graders to describe how Eliza faced challenges to change something in America. Students must identify the main idea, draw evidence from the text and provide supporting details in what they write.

"Scoring supervisors" then make sure that the final scores are not out of whack with the so-called "true" scores from those anchor papers we mentioned earlier. Speed is also a concern, says Bob Sanders, Pearson's director of performance scoring.

"We monitor to make sure they're not scoring too fast, or too slow."

Sanders says some people need more training than others, but if scorers repeatedly fall short of the company's performance guidelines, they're fired. Since April, this scoring center has let 51 scorers go. They've not been hard to replace, though. Pay isn't bad — $12 to $15 an hour if you include bonuses. People without a four-year college degree need not apply.

Pearson officials say that last year, most of the 14,000 people it hired to score tests had at least one year of teaching experience.

But it's not required. So the job has attracted all kinds of folks. Many are stay-at-home parents and retired military who are allowed to work from home.

Then there are people like the ones I met, a former lawyer, a retired longshoreman and a bouncer who handles crowd control at concerts.

There are also people like Pat Squires, a college professor. She's been scoring tests for Pearson and other companies since 2002. She says some people approach the job thinking that scoring and grading are the same. They're not.

"In grading, oftentimes what we're doing is looking at what a student is doing and maybe marking them off for what they're not doing correctly," Squires says.

"But in scoring, you're working against a standard and you're looking to see what students have done correctly."

Still, some scorers point out that what test-makers say a third- or fifth-grader should be able to do sometimes doesn't seem right.

"We don't know how they decided whether this is a third-grade capable response," says Joe Bowker.

He did student-teaching in college and has been scoring tests on and off for several years.

"You have to leave your opinion outside the door," he says.

David Connerty-Marin, a spokesman for PARCC, says it's not up to a scorer or Pearson or PARCC to say, "Gee, we think this is too hard for a fourth-grader."

What is or is not developmentally appropriate, he says, is not an issue because the states have already made that decision based on the Common Core Standards.

"The states, with lots of educators, have reviewed the material and said, 'This is appropriate or not appropriate to the standards,' " says Connerty-Marin. "Our job is to write the test questions that measure whether the student is meeting those standards."

This week Pearson is supposed to wrap up its work on this batch of reading and writing tests. The client states will then get the raw scores, and together they must all agree on the same cut scores to determine which students are at grade level and which ones are not.

Andrew Thompson, the Pearson official who oversees the delivery of these raw scores to the states, says the crucial question is this: Will educators, parents and the public at large trust the results?

"They don't know what we're doing, so there's a lot of misconception about what we do," he says, "and we don't have a way right now to refute that [misconception] and show this is really what we're doing."

Most Americans have been in the dark, says Thompson. So the risk for Pearson, PARCC and the states is that by trying to be more transparent this late in the game, people may very well end up with more questions than answers.

Copyright 2015 NPR. To see more, visit http://www.npr.org/.

Transcript

ROBERT SIEGEL, HOST:

Millions of students took new tests this past school year as part of the Common Core Standards that most states have adopted. There's been a lot of suspicion and uncertainty over those tests, though. When it comes to the written or essay portions of the exams, two big questions are who scores all those answers? And can parents and students trust the results? Well, NPR's Claudio Sanchez went to San Antonio, Texas where one of the biggest testing companies offered a behind-the-scenes look at its operation.

CLAUDIO SANCHEZ, BYLINE: Off a long stretch of highway in a sprawling, bland office park just northeast of San Antonio, Texas, the British-owned publishing giant Pearson runs the nation's largest test scoring center. Inside, folding tables connected end-to-end fill the cavernous space.

JIM LOGAN: We have roughly 58,000 square feet of scoring space.

SANCHEZ: That's Jim Logan.

LOGAN: I'm the operations manager, responsible for the physical facility. And we have approximately 100 scorers here today.

SANCHEZ: Two scorers per table work in silence, their eyes glued to computer screens. For two months now, these people from all walks of life have been scoring a new kind of reading and writing test developed with federal funds. Now it's no secret that these tests aligned with the Common Core Standards will replace old state exams. States are paying Pearson millions of dollars to write the test questions and score the responses. Another organization, Partnership for Assessment of Readiness for College and Careers or PARCC, manages the agreement on behalf of 10 states and the District of Columbia. Their reading and writing tests are being scored at this facility - not by machines but by humans.

VICKERS: I'm scoring third grate compositions - probably 4 questions out of maybe 50 questions, a very small portion of an entire test.

SANCHEZ: Donna Vickers is a retired elementary school teacher.

: I've been here at Pearson doing work like this - this is my eighth year. I work almost year-round.

SANCHEZ: When scoring, Vickers looks for evidence that students understood what they read, that their writing is coherent and that they use proper grammar. And yet, it's not up to Vickers to decide on her own what score a student deserves. Instead, she relies on a three-ring binder filled with so-called anchor papers, samples of student's writing that show what a low score or a high score response looks like.

: And I compare the composition to the anchors and see which score does the composition match more closely.

SANCHEZ: Now eight and nine-year-olds can write some pretty interesting stuff, but the nondisclosure agreement I had to sign keeps me from telling you anything about what they wrote - nothing. Otherwise, says Pearson, everybody would know what's on the test. So here's a sample task company officials approved from a book titled "Eliza's Cherry Trees: Japan's Gift To America." It asks third graders to describe how Eliza faced challenges to something in America. Students must identify the main idea, draw evidence from the text and provide supporting details in what they write. A scoring supervisor then makes sure the final scores are not out-of-whack with the so-called true scores from those anchor papers we mentioned earlier. There is something else, though, that supervisors check.

BOB SANDERS: Speed is a concern of ours, so we monitor their scoring rates to make sure they're not scoring too fast or too slow.

SANCHEZ: Bob Sanders, Pearson's director of performance scoring, says everybody should be working at the same pace. Some people need more training than others, but if scorers repeatedly fall short of the company's performance guidelines...

SANDERS: They know they're going to get fired. Just because someone can write doesn't mean they can evaluate a student's response.

SANCHEZ: The 51 scorers Pearson has let go since April were not hard to replace. Pay isn't bad - $12 to $15 an hour, if you include bonuses. People without a four-year college degree need not apply. Last year most of the 14,000 people Pearson hired to score tests had at least one year of teaching experience, but it's not required. So the job has attracted all kinds of folks - stay-home parents and retired military who are often allowed to work from home, or people like the ones I met - a former lawyer, a retired longshoremen, a bouncer who handles crowd control at concerts and some like Pat Squires, who has been scoring tests for Pearson and other companies since 2002.

PAT SQUIRES: Well, I'm also a college professor.

SANCHEZ: Squires says some people approach the job thinking that scoring and grading are the same. They're not.

SQUIRES: In grading oftentimes what we're doing is we're taking a look at what a student is doing and maybe marking them off for what they're not doing correctly. But in scoring, you're working against a standard, and you're really looking to see what is it that students have done correctly?

SANCHEZ: Still as some scorers point out, what test makers say a third or fifth grader should know and be able to do sometimes just doesn't seem right.

JOE BOWKER: You'll see certain things that, whether you agree with it or disagree, they're the ones that have made the rules for you to follow.

SANCHEZ: Joe Bowker did student teaching in college and has been scoring tests on and off for several years.

BOWKER: We don't know how they've decided whether this is a third grade capable response. You have to leave your opinion outside the door.

DAVID CONNERTY-MARIN: It's not up to a scorer or Pearson or PARCC to say do we think this is too hard for a fourth grader?

SANCHEZ: That's David Connerty-Marin, a spokesman for PARCC. He says what's developmentally appropriate is not an issue because the states that PARCC and Pearson work for have already made the decision based on the Common Core Standard.

CONNERTY-MARIN: The states have - with lots of educators - have reviewed the material and said this is appropriate to the standards or this is not appropriate to the standards. Our job is to write test questions that accurately measure whether the student is meeting those standards.

SANCHEZ: This week Pearson is supposed to wrap up its work on this batch of tests. The client states will then get the raw scores and together they must agree on the same cut scores to determine which students are reading and writing at grade level, which ones are not. Andrew Thompson, who oversees the delivery of these raw scores to the states, says the crucial question is this - will educators, parents and the public at large trust the results?

THOMPSON: They don't know what we're doing, so there's a lot of misconception of what we do. And we don't have a way right now to refute that, and show, no, this is really what we're doing.

SANCHEZ: Thus far, Thompson says, most Americans have been in the dark. So the risk for PARCC, Pearson and the states is that by trying to be more transparent this late in the game, we'll end up with more questions than answers. Claudio Sanchez, NPR News. Transcript provided by NPR, Copyright NPR.