Multimodal speech and how to collect it

A lot happens during a conversation. Speech overlaps as participants take turns. Hand, body and eye movements help listeners to keep track. Professor Naomi Harte of Trinity College Dublin described her work on speech recognition and recording, in a talk to the COG-MHEAR teams. She explained how it is relatively easy for systems such as Alexa to understand one person speaking clearly. Once more than one person is speaking, the conversation becomes more difficult to comprehend. There are also problems with machines figuring out differing accents, speech impediments, or understanding what is said when children or elderly people are talking. Being able to see someone’s mouth as they speak is often helpful. Our lip movements to make different sounds can look similar and our brain uses the visual information from the lips alongside the sound from a speaker to fully resolve a message. The visual side is so powerful in our brain that if the lip movement does not seem to match the sound being made, then our brains can process this to make us think that a different sound is being made to the one that is actually being spoken.

Naomi explained that good recordings of groups of people speaking are key to analysing speech, for human and machine interaction. Multimodal recordings show the richness of a conversation, including the hesitations, and have video carefully synchronised with sound. It is important to get people to talk naturally as they are filmed. Games such as Family Feud can help, where contestants guess survey responses to questions. Recordings of people chatting online as well as in person, can be used for analysis of these different types of conversation. Researchers look into details such as the way in which speakers and listeners give cues to enable each to take turns in the conversation, or how head nods and other gestures are used. There are cultural and individual variations in the way in which this is carried out.

People automatically enhance their ways of speaking in whatever way they consider is necessary to help the conversation along. This includes increasing lip movement or other visual cues, if this seems to be necessary. Analysing speech can help improve the way in which humans and machines communicate.