fb-pixel Skip to main content

One small step: Robots learn to predict sounds from sights

Shutterstock/Globe staff illustration

Here’s a fun game: Tap your finger against a surface, but before you do, predict the sound it will make. Did you get it right? If not, you better practice, because pretty soon a robot’s going to play this game better than you can.

On the march to the robot apocalypse, the ability to perform such a quirky task may not seem especially portentous, but new research out of MIT demonstrates why such a capacity lays the foundation for far more sophisticated actions.

Andrew Owens, a PhD student in MIT’s Computer Science and Artificial Intelligence Laboratory, and his collaborators presented the research at a conference in Las Vegas last month. There they explained how they’d engineered a computer algorithm that can “watch” silent video of a drumstick striking different kinds of objects and create a sound that, in many cases, closely matches the one generated by the actual event.

“The computer has to know a lot about the physical world,” says Owens. “It has to know what material you’re hitting, that a cushion is different than a rock or grass. It also has to know something about the action you’re taking. If you’re [striking the surface] hard, it should be a loud sound loud, if soft, a softer sound.”


To prepare the algorithm, the MIT researchers fed it 1,000 videos containing 46,000 distinct sounds made when a drumstick strikes (or scrapes, or prods) different kinds of objects. The computer analyzed those sounds while remembering the visual images associated with them. Then, when confronted with a new video, played on mute, the algorithm searchs its inventory of visual-audio combinations and pulls together a sound that it thinks best fits what it’s observed.

So far, progress is uneven. The program is better at modeling outdoor sounds where it can effectively distinguish a hard surface like a rock from a soft one like dirt. It struggles with indoor noises.


“It has some trouble with indoor hard surfaces. It’s not good at distinguishing plastic from wood from metal,” says Owens.

The ability to perform this kind of task is more than a party game. It’s a skill that could make robots useful on construction sites, where tapping an object, and registering the sound it makes, is a technique used all the time to get a sense of what something’s made of.

“If you don’t know what material something is by looking at it, what’s the next step? You tap the walls, listen to the sound, see if it’s hollow or not hollow,” says Abhinav Gupta, a professor of robotics at Carnegie Mellon University.

More generally, the kind of ability demonstrated in this new MIT research marks an important step toward the realization of a major goal in robotics. To successfully navigate the world, robots will have to draw on all their senses at once, combining information gained through touch, vision, and hearing in order to make sense of what’s in front of them and figure out how to proceed. That, after all, is how people do it. But 40 years ago, in the early days, engineers found it too complicated to work on all these sense modalities at once, so the field splintered, with research on each sense proceeding in isolation from the others.

“The hope was everyone would solve their own problems, and then we’d come back together,” says Gupta.


Now that’s beginning to happen. The MIT research is one of a few examples of research that combines different senses. Other advances include Gupta’s work that combines touch and vision. And while it may seen charming to have a robot that can hear and see in order to create the soundtrack for a drumstick hitting a table, that’s not where this is likely to end. Imagine, some years hence, a later version of this algorithm programmed into a nimble walking robot: It hears a creak in the floorboards, sees a door ajar and knows you’re hiding in the closet.

Kevin Hartnett is a writer in South Carolina. He can be reached at kshartnett18@gmail.com.