- Published on
Hearing better in a noisy cocktail party environment with lip-reading
- Author
Wenwu Wang
Professor in Signal Processing and Machine Learning
Prof Wenwu Wang came to Edinburgh recently to discuss his work on speech source separation with the COG-MHEAR teams. Speech source separation is often described as the answer to the cocktail party problem: when lots of people talk at once so that it is hard to hear speech from one person. Wenwu finds ways to isolate individual voice from mixed sound. He started by working using audio only. He then showed how getting an image of someone’s lips as they are speaking can help to get a clearer recording of the one talker that you want to hear. He has done that using audio signals and by videoing people talking. Using features such as the changing width and height of a talker’s lips can help with this. He has then discussed how to build models to characterise the relation between the audio and video signals and incorporate such models to improve the quality of the speech extracted from the target talker, including the statistical models, sparse representation models, and deep learning models.
And can this be done in real time? Or at least with the maximum 20 millisecond delay above which you would start to notice a lag between the movement of a talker’s mouth and the sound reaching your brain? And can this be done automatically, by robots?
Wenwu went on to explain more about the way in which one person speaking can be extracted from several microphone signals at once. This can be achieved by using convolutive independent component analysis with visual guidance to correct permutation misalignments in the separated speech components, or using time-frequency masking techniques with audio-visual dictionary learning, for example, through regularising the time-frequency mask obtained from audio data by the time-frequency mask estimated from video signals.
It helps to know more about the way in which humans understand sound: how our brains separate out one voice from surrounding sound. To a robot these are all noises. Programming robots to work out what is speech and what is background noise; and then teaching it to recognise individuals speaking: that is key to getting a clear audio signal, so that you can hear what is being said.