What are Microsoft, Apple, Amazon, and Nuance up to in Cambridge? Suddenly, the four tech giants are scooping up every speech recognition expert they can find and plunking them down in offices in Kendall and Central Square.
It’s not because it’s tough to identify the words you’re speaking to a mobile device. In fact, I’ve been dictating the opening of this column to my iPhone, including punctuation, and you can see that the accuracy ain’t bad. (I corrected it only on the company name Nuance, which it repeatedly heard as “new wants.”)
Instead, these four players are focused on new frontiers. One is putting speech into new kinds of devices, like television remote controls, cars, and perhaps future wearable computers like wristbands, earpieces, or glasses. Another is developing the ability to understand not just the words you’re saying, but what you actually want.
“More and more, you want your devices to be personalized — to understand the context you’re in, both geographic and based on what you’ve done in the past,” says TJ Hazen, a speech scientist hired in January by Microsoft’s New England Research and Development Center.
(OK: I stopped dictating toward the end of that last paragraph, because speech recognition is still utterly at sea when it comes to proper names. Hazen’s name was “TJ Hayes and,” and it also interpreted earpieces as “your pieces.”)
Cambridge isn’t suddenly a nexus of speech activity. There has been “a quiet building” over the past four decades or so, in the words of Peter Mahoney, Nuance’s chief marketing officer. Labs at MIT and research at places like BBN Technologies, now part of Raytheon, have chiseled away at the incredibly difficult problems of recognizing the words you’re speaking, and trying to intuit what you mean by them.
John Makhoul wrote a doctoral thesis on speech recognition at MIT in the 1960s, and then went to work at BBN in 1970 at the age of 28. Much of BBN’s funding came from the Department of Defense’s advanced research arm. In the 1970s, Markhoul recalls, “It took an hour using a mainframe from Digital Equipment to recognize one sentence.” And that system only recognized 1,000 words.
All that academic and military-funded research spawned companies like Kurzweil Applied Intelligence, Dragon Systems, SpeechWorks, and Wildfire Communications, which brought speech recognition to personal computers and telephone call centers. Many of the smaller firms were eventually acquired by Nuance, a publicly traded company in Burlington; it sells speech-recognition software to customers like Apple, Amazon, and BMW, and offers its own products like the Dragon Mobile Assistant for Android smartphones.
What’s everyone working on now? Dan Miller of Opus Research says “the holy grail” is a personal assistant that recognizes you quickly, understands your preferences, your privileges, and where you are. Apple’s Siri personal assistant, introduced in 2011 and built with technology from Nuance, “raised the level of attention around speech interfaces in a big way,” says Mike Phillips, a serial entrepreneur in the speech industry.
But plenty of people have mocked Siri’s shortcomings. Layering on contextual information — like where your next meeting is, or what restaurants you’ve liked in the past — will help the speech-driven systems get smarter. “What’s the best place for dinner near my next meeting?” is the kind of question they’ll be able to answer.
The sudden emphasis on upgrading speech capabilities in various devices means that speech gurus in Cambridge now “have more choices of where they want to work,” says Hazen. It also means that salaries are skyrocketing; a newly minted PhD can start in the six figures, and top dogs can earn close to $200,000.
Next month, Nuance plans to open a research and development office for about 120 employees in Central Square. Mahoney says he expects it to grow to 175 in the near-term. “A certain element of the technical crowd likes to be close to where other great technical minds are,” Mahoney says. “Cambridge has that gravitational force, and the university heritage.”
When Amazon began building its stealthy Cambridge R&D outpost in late 2011, one of the two leaders it hired was Bill Barton. His LinkedIn profile says he is “leading development of speech and language solutions which will enhance user interactions with Amazon products and services.” And Amazon also acquired a start-up called Yap, based in North Carolina, late that year. Yap’s head of research, Jeff Adams, is a veteran of Nuance and Kurzweil Applied Intelligence who lives in the Boston area. He now works for Amazon, according to LinkedIn. (Amazon didn’t respond to my requests for interviews.)
Apple’s Cambridge group is the smallest of the four, with just a handful of people working in an office at the Cambridge Innovation Center. The group includes three Nuance alums working on improvements to Apple’s Siri, including Larry Gillick, the chief speech scientist for Siri. Apple also didn’t respond to interview requests.
Understanding what we mean when we talk is one of those thorny technical problems that Boston loves to tackle. We’ve been throwing our smartest people at it for more than 40 years now, and as speech recognition becomes something tech giants need as a differentiator for their newest products, they need to hire talent here.
Today, Makhoul still works at BBN, where he is the chief scientist. His office is in the same building where he worked when he arrived in 1970.
He says that these days, BBN’s speech scientists are interested in challenges like understanding people who use slang, have an accent, or talk in noisy environments. And, he adds, “real understanding of what you mean is still a long way off.”
I asked Makhoul whether he is surprised at how long it has taken to sculpt speech recognition software into something useful. “I was one of the pessimists,” he said. “I’m pleased that in 40 years we’ve come this far.”