Ideas

BRAINIAC

Teaching computers to understand non-native English

A student holds a Portuguese- English dictionary.
The Boston Globe
A student holds a Portuguese- English dictionary.

At the end of July, researchers from the Center for Brains, Minds and Machines at MIT released a major new resource for the study of an often overlooked variety of language: English spoken by non-native speakers. It comes in the form of a database called the Treebank of Learner English that catalogues all the grammatical idiosyncrasies found in 5,000 English-language sentences written by people who don’t speak English as their first language.

The creators of the Treebank anticipate it will provide a platform for the study of learner English and also make it easier to develop technology like better search engines that supports non-native speakers. “Most people in the world that speak English speak English as a second language, yet at the same time, when we look at the scientific study of the English language, it’s mostly based on text produced by native speakers,” says Yevgeni Berzak, a graduate student in electrical engineering and computer science who led the project.

To study a language, it’s useful to have what linguists call a “treebank,” a collection of sentences that have been annotated to describe the language’s basic rules and grammatical constructions. These annotations are similar in spirit to the way kids are taught to diagram sentences in school and when collected together, they provide researchers with a rich dataset in which to study how languages work.

Advertisement

The Treebank of Learner English uses an annotation scheme called Universal Dependencies. It is a favorite among computational linguists because it provides enough structure to reveal interesting patterns, while also being flexible enough to be applied to any language.

Get Arguable with Jeff Jacoby in your inbox:
From the Globe's must-read columnist, an extra offering each week of opinion and ideas.
Thank you for signing up! Sign up for more newsletters here

Users can search the sentences in the Treebank by keyword, the writer’s native language, and specific kinds of grammatical errors, like missing adverbs or unnecessary pronouns. A search for “T.V.” for instance, returns this sentence by a native Russian speaker: “Evry [sic] family has now T.V…” The annotations on the sentence capture the fact that the writer uses a word order that would be correct in Russian, but is incorrect in English (has now), and also omits an article, “a.” Another search comes back with “I like very much the mountains,” which is a correct construction in the writer’s native French, but wrong in English.

These kinds of carry-over effects are common. “By the time you’re heading into your teens and acquiring a second language, there’s always a residue effect from your first language,” says Christopher Manning, a computational linguist at Stanford University.

By cataloging those residue effects, the Treebank researchers hope to achieve two ends. First, they want to provide a dataset that can be used for the scientific study of “learner English.” Second, they expect it could help computer programmers make the Internet a more accommodating place for non-native English speakers.

Increasing numbers of computer applications rely on automatic interpretations of text entered by users – you type (or say) a query and the software tries to determine what you mean by it. These include question answering, translation between languages, information extraction, conversational agents, and grammar checking in word processing programs. All these applications use a technique called machine learning, which draws on a collection of standard English sentences to guess what you might be after. It works less well, though, when the user is a non-native English speaker using non-standard grammatical constructions. This means that many computer applications aren’t well-tuned to the varieties of learner English they increasingly have to process.

Advertisement

“Learner English has become a language of communication, so that’s something you want to successfully process if you’re interested in doing any kinds of technological representations of language,” says Manning.

In other words, while Google touts its facility with dozens of world languages, to really be a universal tool, it might need a better understanding of the dozens of different ways that non-native speakers use English.

Kevin Hartnett is a writer in South Carolina. He can be reached at kshartnett18@gmail.com.