MIT’s New AI Can( Sort of) Fool Human With Sound Impacts
Over at MIT’s Computer Science and Artificial Intelligence Laboratory( CSAIL ), a squad of six researchers generated a machine-learning system that matches sound effects to video clips. Before you get too excited, the CSAIL algorithm can’t do its audio work on any old video, and the audio effects it makes are restriction. For the project, CSAIL PhD student Andrew Owens and postgrad Phillip Isola recorded videos of themselves whacking a bunch of things with drumsticks: stumps, tables, chairs, puddles, banisters, dead foliages, the dirty ground.
The team fed that initial batch of 1,000 videos through its AI algorithm. By analyzing the physical appearance of objects in the videos, the movement of each drumstick, and the resulting sounds, the computer was able to learn connections between physical objects and the sounds they build when struck. Then, by “watching” different videos of objects being whacked, tapped, and rubbed by drumsticks, information systems was able to calculate the appropriate pitching, volume, and aural properties of the audio that should accompany each clip.
The algorithm doesn’t make its own sounds–it just pullings from a database of tens of thousands of audio clips. Also, sound consequences aren’t selected based on visual matches; as you can see around the 1:20 mark of the video above, the algorithm get creative. It selected sound consequences as varied as a rustling plastic container and a smacked stump for a sequence in which a shrub gets a thorough drumsticking.
Owens says studies and research team used a convolutional neural network to analyze video frames and a recurrent neural network to pick the audio for it. They leaned heavily on the Caffe deep-learning framework, and the project was funded by the National Science Foundation and Shell. One of the team members works for Google Research, and Owens was part of the Microsoft Research fellowship program.
” We’re largely applying existing techniques in deep learning to a new domain ,” Owens says.” Our goal isn’t to develop new deep learning methods .”
Matching realistic voices to video has primarily been the domain of Foley artists–the post-production audio wizards who record the footsteps, doorway creakings, and flying roundhouse kicks you consider( and hear) in a polished Hollywood movie. A skilled Foley artist can make a sound that precisely matches the visual, fooling the spectator into thinking that the audio was captured on the set.
MIT’s bot isn’t nearly that hotshot. The research team conducted an online survey where 400 participants were proven versions of the same video with the original audio and the algorithm-generated sounds, then asked to pick which video had the real voices. The fake audio was selected 22 percent of the time–very far from perfect, but still twice as efficient as a previous version of the algorithm.
According to Owens, those test results are a good sign that the computer-vision algorithm can see the materials an object is made of, as well as the different physics of tapping, whacking, and scraping an object. Still, certain things tripped the system up. Sometimes it supposed the drumstick was striking an object when it actually didn’t, and more people were fooled by its sound impacts for leaves and grime than its sound effects for more solid objects.
There’s a deeper reason behind the project beyond only making fun sound impacts. If perfected, Owens thinks the computer-vision tech could help robots identify the materials and physical properties of an object by analyzing the sounds it attains.” We’d like these algorithms to learn by watching this physical interaction result and observing the response ,” Owens says.” Think of it as a toy version of learning about the world the style that newborns do, by banging, stomping, and playing with things .”