Modelling the universe with sound.

Imagine a huge, stone cathedral with a sanctuary that is reverberating with the thunderous tones of a pipe organ. The location of the organ, the listener’s position, if any columns, pews, or other obstructions are in the way, the material of the walls, the placement of windows and entrances, etc., all impact the sound that a cathedral visitor will hear. A sound may aid someone in visualising their surroundings.

The use of spatial acoustic data to aid machines in better understanding their surroundings is also being investigated by researchers at MIT and the MIT-IBM Watson AI Lab. They created a machine-learning model that can mimic what a listener would hear at various positions by capturing how any sound in a room would travel across the area.

The technology can understand the underlying 3D geometry of a room from sound recordings by precisely modelling the acoustics of a scene. Similar to how people utilise sound to infer the characteristics of their physical surroundings, the researchers can create precise visual representations of a space using the acoustic data their system collects.

This method might aid artificial intelligence agents in better comprehending their surroundings in addition to its potential uses in virtual and augmented reality. According to Yilun Du, a graduate student in the Department of Electrical Engineering and Computer Science (EECS) and co-author of a paper describing the model, an underwater exploration robot could sense things that are farther away by modelling the acoustic properties of the sound in its environment.

“Most studies to date have solely concentrated on simulating vision. However, human perception is multimodal. Not only is hearing crucial but also seeing. This approach, in my opinion, offers up an intriguing research line on how to better use sound to describe the world “You say.

Lead author Andrew Luo, a graduate student at Carnegie Mellon University (CMU), senior authors Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in MIT’s Department of Brain and Cognitive Sciences and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and Antonio Torralba, the Delt Professor of Cognitive and Brain Science at CMU, and Michael J. Tarr, the Kavi-Moura Professor of Cognitive and Brain Science At the Conference on Neural Information Processing Systems, the study will be presented.

aural and visual

An implicit neural representation model, a sort of machine-learning model, has been utilised in computer vision research to produce continuous, smooth reconstructions of 3D scenes from photographs. These models make use of neural networks, which are composed of layers of linked nodes, or neurons, that analyse data to act. The same kind of model was used by MIT researchers to depict how sound permeates a scene continually.

However, they discovered that sound models do not share a trait known as photometric consistency that makes vision models more advantageous. The identical thing seems to be about the same when seen from two different angles. However, when it comes to sound, various places might result in entirely different sounds owing to obstructions, distance, etc. As a result, audio prediction is quite challenging. The reciprocal nature of sound and the effect of regional geometric elements were two acoustic aspects that the researchers included in their model to solve this issue.

Since sound is reciprocal, it doesn’t matter where the source of a sound is coming from or where the listener is located. Additionally, local characteristics, such as a barrier between the listener and the sound source, have a significant impact on what one hears in a specific location. They enhance the neural network with a grid that records objects and architectural characteristics in the image, such as entrances or walls, to account for these two aspects in their model, known as a neural acoustic field (NAF). To learn the characteristics at certain places, the model randomly selects points on that grid.

“If you see yourself standing close to a doorway, the doorway itself—not necessarily geometric characteristics on the opposite side of the room that is far away from you—has the most influence on what you hear. Compared to a straightforward fully linked network, we discovered that this data allows for superior generalisation “Luo claims.

ranging from anticipating noises to imagining scenarios

Researchers may provide the NAF with visual data about a set and a few spectrograms that illustrate how an audio recording might sound when the emitter and listener were situated at certain points around the room. The algorithm then forecasts what the audio would sound like at any location in the scenario where the listener may move.

The NAF produces an impulse response that depicts how a sound should alter as it spreads across the environment. To determine how different sounds should change when a person passes around a space, the researchers then apply this impulse response to various noises. Their model would demonstrate, for instance, how music playing from a speaker in the middle of room changes in volume as a person approaches it before being muted as they exit into a nearby corridor.

The researchers found that their strategy consistently produced more precise sound models when compared to other techniques that model acoustic data. Additionally, their model outperformed previous approaches in its ability to generalise to other places in a scene because it learnt local geometric information. In addition, they discovered that incorporating the acoustic data their model picks up into a computer vision model may improve the visual reconstruction of the scene.

“By using these acoustic qualities, you may, for example, catch boundaries more precisely when you just have a limited number of views. Perhaps this is so because to simulate a scene’s acoustics properly, you must first capture its underlying 3D geometry “You say.

The model will be improved further by the researchers so that it may be used in fresh scenarios. Additionally, they aim to use this method for more involved impulsive reactions and bigger scenarios, such as whole buildings or even a whole town or metropolis. In the metaverse application, “this new method could offer up new chances to build a multimodal immersive experience,” continues Gan.

“My team has put a lot of effort into modelling the acoustics of real-world settings or accelerating acoustic simulation. This paper by Chuang Gan and his co-authors represents unquestionably a significant advancement in this direction, “says Dinesh Manocha, a professor of computer science and electrical and computer engineering at the University of Maryland who was not involved in this work but holds the Paul Chrisman Iribe Chair in those fields.

“In particular, this study proposes a suitable implicit representation that can describe sound using a linear time-invariant system to reflect how sound might spread in real-world scenarios. This research has several uses for both real-world scene interpretation and AR/VR.”

Leave a Comment