Cars That Listen Better Than People

Innovation Blog | December 24, 2019

Until rather recently, speech recognition was a car feature that was “driver-distraction safe” but one that actually negatively impacted customer satisfaction. However, modern digital assistants like Amazon Alexa, Google Assistant, and Apple’s Siri are bringing improved recognition accuracy and natural language understanding to computer conversations. As a result, computer speech interaction is on the rise everywhere – including within vehicles. In fact, given the adoption curves on digital assistant technology, it’s not hard to imagine voice soon becoming the primary interaction style between people and cars.

Two factors potentially stand in the way. One is the in-vehicle acoustic environment. Road, wind, and other vehicle noises are challenging for human listeners, let alone for speech recognition algorithms. The other problem is the terribly human trait of talking over each other. As we become more dependent on our digital assistants, we’ll need them to be able to separate and understand simultaneous conversations just like people do, rather than forcing everyone in the vehicle to become silent while one person issues a command to the car.

We know that recognition accuracy will continue to be a critical driver for customer acceptance of in-vehicle digital assistants. At Mitsubishi Electric, we’ve been continuing to refine our technology to address these problems. You may have read our blog about the cocktail party problem or seen our announcement about our simultaneous speech separation technology.

In our new speech separation system, we use deep-learning methods – used in the field of artificial intelligence – to cluster multiple people’s speech. This is similar to how a human can engage in an alternating conversation between two or more speakers, a process aided by knowing each speaker’s unique voice. Our algorithm effectively groups multiple people’s speech who may be talking over each other, regardless of what languages are being spoken!

We combine this speech processing with images of people’s faces from camera feeds. Our system has been trained on thousands of hours of video, using machine learning to match vocal sounds with facial expressions. Just as it’s far easier to understand what a person is saying when you can see their face, our algorithm uses visual cues to help augment the understanding of what people are saying against competing speech and road noise. This is much like the lip reading that we humans do to understand our friend’s speech in high noise backgrounds like a loud restaurant – something that people do naturally without noticing but that computers have only recently learned to do.

We know that autonomous cars will need even better in-car speech. Even though self-driving cars may have more display screens than today’s cars, speech is a more natural communication mechanism than a graphical user interface. We believe that our communication with cars will be increasingly driven by speech as a friendlier and more powerful channel for most tasks. In addition, while our current manual mobility experience is in pilot and co-pilot mode, self-driving cars will turn vehicles into mobile entertainment and social centers with passengers conversing and interacting. In this more boisterous audio environment – similar to kitchen or living room spaces – the need for the car to distinguish individual speakers and extract their speech with accuracy will be paramount.

In anticipation of this need, we’re leveraging our expertise in speech processing to deliver cleanly separated, distinct audio streams for each speaker with the majority of noise removed – perfect for multiple simultaneous speech dialogs within the car from several people. This technology will be demoed in our CES booth – along a host of other new mobility solutions. Sign up to get a private tour.

Jacek Spiewla Sr Manager, User Experience

Jacek holds a Master’s in Human-Computer Interaction from the University of Michigan, and has a deep background in speech/audio processing technology, as well as voice user interface design. He is responsible for strategic planning activities and coordinating UX projects.