Mass. ponders hiring a computer to grade MCAS essays. What could go wrong?
Each year, students taking the state's MCAS exams generate more than 6 million essays and other written responses, requiring a small army of graduate students, educators, and other professionals to read and score them — a labor-intensive task that eats up much of the summer.
Now, in an effort to speed up the delivery of MCAS results to schools and families, the state Department of Elementary and Secondary Education is exploring the idea of replacing human test scorers with a computer program.
That's right: Students would be producing a response that may never be read by anyone, denying them the chance to stir a reader's emotion, draw a laugh, or sway an opinion.
Instead, the responses would be processed by an algorithm that evaluates such elements as sentence structure, word choice, and length. Whether it would detect factual errors is still a matter of debate.
Many educators worry that using automated essay scoring could be unfair to students who often agonize over every word.
"Do they value kids' work in writing or not?" said Michael Maguire, a teacher at Boston Latin Academy. "If you want students to write, then you have to have a human sit down and read it."
Jeffrey Riley, commissioner of elementary and secondary education, stressed the department has not formally proposed the change.
But, he said, "if there is a technology out there that can help us, it's incumbent on us to look at it."
He said the technology could help the state deliver full sets of MCAS scores to districts in the summer instead of the fall. That, in turn, would enable districts to analyze the results and adjust instruction before the school year begins.
The state Board of Elementary and Secondary Education plans to discuss the idea at its meeting Tuesday in Marblehead.
Automated scoring, which dates back to the 1960s, often includes a human element. The programs sort through hundreds of essays scored by people from previous standardized tests to create a model that predicts how a human would score future essays.
That model considers the complexity, or simplicity, of the words and sentence structures, and even the grouping of certain words that have appeared in previous essays.
In some cases, the programs will assign a confidence value to the score being generated. For example, if the writing patterns are not sufficiently similar to the archived essays, then it would assign a low confidence value, signaling that a person should review the essay.
A growing body of research has suggested little variation between scores issued by computer programs and human beings. In some cases, that research has been conducted in conjunction with Pearson and other companies that have automated scoring products.
"When used responsibly and carefully, automated essay scoring is faster and cheaper and can even exceed the validity of human scores," said Jon Cohen, executive vice president at American Institutes for Research and president of AIR Assessment, which provides automated essay scoring in standardized testing. "One of the big challenges in human essay scoring is to get [written responses] scored reliably."
That's because there is an element of subjectivity in judging writing — what appeals to one person might not appeal as strongly to another. Cohen said testing companies remedy this by training people who score written responses to adhere strongly to a set of detailed standards for each scoring category.
But automated scoring has drawn many vocal detractors. Les Perelman, a researcher and retired Massachusetts Institute of Technology writing professor, argues that automated scoring systems are not only inaccurate but are detrimental to writing instruction. In an era of teaching to the test, he said, teachers will drill strategies into their students to game the computer programs to get higher scores.
To test just how unreliable automated scoring systems can be, Perelman and a small group of MIT and Harvard graduate students four years ago generated an essay of gibberish that included obscure words and complex sentences and ran it through a computerized scoring system used for a graduate school admission exam. The nonsensical essay achieved a high score — on the first try.
"We thought it would catch us the first time and it would take some reiteration," he said. "It just shows how rudimentary the system is. It counts the most basic things. Computers are not aware of meaning."
In another study, Perelman tested the accuracy of an automated grammar check with a famous speech by Noam Chomsky, "The Responsibility of Intellectuals." It erroneously identified several grammatical errors.
Across the nation, interest in automated scoring has grown as states have been moving their testing systems from paper booklets to the Internet, opening the possibility to expand computerized scoring beyond multiple choice questions.
Few states, though, have adopted the technology, according to testing experts, as the practice remains highly controversial.
Earlier this year, Ohio education officials faced a public backlash after they revealed they had quietly implemented automated scoring for student writing on standardized tests. The issue came to light after several districts spotted irregularities in the results, according to media reports.
Fully aware of the skepticism surrounding the technology, Utah treaded carefully when it adopted automated scoring as part of a new standardized testing system during the 2014-15 school year, using hand scorers to verify the results for the first two years.
"Overall, I've been impressed with how closely our automatic scoring matches up with teacher scoring," said Cydnee Carter, Utah's assessment development coordinator, noting that about 20 percent of the essays continue to be reviewed by people.
Utah has instituted other measures to ensure quality control. For example, the computer program, which uses a confidence index for its scores, will flag any student who receives a high score in reading but a low score in writing.
Teachers also can file a grievance with the state if students receive a lower mark than they typically earn in class. Carter said in most cases the automated score holds up under review.
There have been some glitches, however, she said. Officials had to readjust the program to spot gibberish as well as students who were quoting too heavily from reading passages they were writing about, both of which were artificially inflating scores.
Some students also were attempting to game the computer program by writing one spectacular paragraph and then repeating it over and over again. The computer program now picks up on that.
Transparency, she said, is key.
"Giving teachers and students an opportunity to understand how the essays are being scored ultimately helps students become skilled writers," Carter said. "We don't want students gaming the system and not show what they know."
Massachusetts officials have been toying with the idea of automated scoring of essays since at least 2016. A work group examining changes to the MCAS exams, which consisted of state administrators and local educators, was split on the idea and recommended sharing information with local school systems about it, according to a summary of their discussion.
If Massachusetts pursues automated scoring, state officials said they would likely tap human beings to spot-check the results.
"Personally I can't imagine going into any automation that doesn't have multiple levels of backup and quality control," said Jeff Wulfson, deputy education commissioner.