Enabling Immersive Sound with OZO Audio

Over the years, Nokia has introduced several audio technologies that have transformed the way we use our mobile devices. Our research and development has been driven by consumer needs and emphasizes ease of use, reliability, and quality.

In consumer applications, high-quality audio capture recording relies on automating features—otherwise, they would be too difficult to use, and complicate creating seamless, user-friendly interfaces. For example, in a typical mobile video application many features are automated so that the user doesn’t have to focus on the technology and instead can concentrate on the creative aspects of recording video. The user does not need to worry about audio at all, because everything is fully automated.

Solving challenges in automating high-quality audio capture

After many years of work and close technical collaboration with microphone manufacturers, Nokia was able to deliver high dynamic range (HDR) audio recording, using MEMS microphone components, which are able to record high sound pressure levels without distortion. Our goal was to introduce a microphone that could record audio at a 140dB sound pressure level, which is still a very demanding target for microphone manufacturers.

HDR recording enabled the introduction of a technology (Nokia Rich Recording) for high-quality recording of challenging sound environments, such as very loud rock concerts or fireworks, without distortion. With Nokia Rich Recording, the automated recording system became more reliable and robust without limiting content creation in difficult acoustic environments.

While modern studio microphones have a better signal-to-noise ratio than smartphone microphones, the MEMS microphone industry has made strides in bridging the gap between mass-produced consumer electronics components and high-performance studio microphones, which can have a microphone transducer area that’s 1000 times larger than consumer components.

An additional difficulty in automating the capture of high-quality audio outside a studio environment is how to define noise. In theory, noise is an unwanted signal or a disturbance. In practice, however, it’s not always easy to determine what signal components are unwanted. For example, two different video recordings—one, say, at Niagara Falls and the other in an industrial or urban environment—can have similar acoustic characteristics, yet the subjective interpretation of how unwanted the background ambience is can vary significantly. Depending on the context of the recording, an acoustic ambience can be either the essence of an immersive audio experience or a distraction. Therefore, an audio recording system that is designed to capture the full details of a spatial audio scene still needs some input from the user about the target of the captured content.

With Nokia OZO Audio technology, the user can choose whether to capture the entire 3D sound scene or to emphasize sound sources from a specific direction while attenuating sound from other directions. Audio Focus, the functionality that allows users to select which direction to emphasize, is available in OZO Audio–enabled devices with three or more microphones.

Implementing OZO Audio in consumer devices

There are multiple factors involved with spatial audio capture that come into play when adding this technology to devices. We have experience with manufacturing consumer and professional products with very different form factors and have been able to prove the flexibility of our industry-leading spatial audio technologies, which can be integrated into a wide range of devices.

Spatial audio capture for smartphone products typically involves cost optimization, which favors an advanced algorithmic solution for delivering great audio performance with a minimal number of microphones. OZO Audio supports spatial audio capture with as few as two microphones, while 360° capture is available for devices with at least three microphones. Full 3D capture is available for devices with four or more microphones.

Product implementations typically entail a close collaboration with the customer, where Nokia audio experts provide support and guidance to capture the best possible spatial audio. OZO Audio algorithms are optimized based on the customer’s device design, target use cases, and number and placement of microphones. Our customers can also take advantage of our world-class, certified acoustic laboratory facilities and performance evaluation systems.

Due to the flexibility of OZO Audio technologies, manufacturers can use the same technical implementation across their product portfolio. This enables interoperable content creation, sharing, and consumption experiences for users.

Bringing spatial audio to mobile devices

During the development of OZO Audio, we learned that none of the existing audio formats were a good choice for enabling spatial audio capture in a mobile device.

Traditional surround-sound systems use channel-based audio formats, such as 5.1, to support audio playback in home-theater environments. The spatial resolution of channel-based formats can be improved by increasing the number of loudspeaker channels, but this is a wasteful way of increasing spatial resolution, especially when most of the content is consumed through headphones. In addition, traditional channel-based audio formats with asymmetric loudspeaker configurations are not ideal for virtual reality (VR) content, where audio formats need to support head-tracking.

The popularity of 360° video content and VR content increased the interest in Ambisonics-based content formats, because Ambisonics supports an easy way to pan 3D audio, which is needed in VR headsets. The spatial resolution of Ambisonics-based systems can be improved by increasing the order of the Ambisonics format from First-Order Ambisonics (FOA) to Higher-Order Ambisonics (HOA), which requires dedicated compression technologies, such as MPEG-H, for data-efficient content delivery. Unfortunately, high-quality Ambisonics microphones typically have tens of microphones integrated in spherical harmonics. Smartphones have very different design requirements—they’re built as thin as possible, have large displays on one side, and can only have a few microphones.

That’s why we looked for alternative ways to preserve spatial resolution and avoid unnecessary complexity and implementation overhead for mobile devices, where the primary content consumption is driven by headphone listening (with or without head-tracking). Also, mobile 3D audio requires support for 3D audio playback, and we wanted to address that.

To deliver a mobile-friendly spatial audio solution, OZO Audio uses the most widely supported digital audio format: AAC stored in an MP4 container. OZO Audio enables 3D audio content sharing for headphone listening for practically all media devices that can play digital stereo audio. In addition, we introduced a standard-compliant OZO Audio metadata extension to support VR audio content playback (head-tracking) and alternative encoding of high-quality spatial audio content to other multi-channel or Ambisonics-based audio formats.

Thanks to the rapid development of mobile technologies, many people are carrying a smartphone with 4K video-recording capability. A perfect companion for 4K video recording is advanced spatial audio capture technology, which brings a new level of immersion to high-quality media content. With spatial audio capture, users can create new media experiences and achieve a sense of spatial presence that is typically missing from the content people share through channels such as social media.

As a leader in developing audio technologies, Nokia brings immersive audio to consumers everywhere, connecting people through accurate and natural-sounding spatial audio experiences.