Hearing lips and…

hands, smiles and print too: How listening to words in the wild is not all that auditory to the brain

Abstract

Most research on the organization of language and the brain discards “context” in favor of isolated speech sounds or words. I describe an alternative model in which context is central rather than secondary to comprehension. By this model, the brain uses available context to predict forthcoming sounds or words resulting in less need to process incoming auditory information. This predictive mechanism is implemented in simultaneously active networks, each extracting information from a type of context. These networks are weighted by the informativeness of the contextual information and listeners’ prior experience with that information. I provide evidence for this model by demonstrating that speech-associated mouth movements, co-speech gestures, and valenced facial expressions are all used by the brain to generate predictions of sounds or words associated with those forms of context, resulting in a dramatic reduction of processing in auditory areas. I show that mouth movements and gestures are processed in spatially separable networks and that the weighting of each network is determined by how much predictive information is in an observed movement. Finally, I show that if listeners’ prior experience with heard speech is through reading, auditory language comprehension is largely reliant on visual networks. To summarize, real-world language comprehension is not always strongly reliant on auditory brain areas because (visual) contextual information can be used to supplant processing in those areas. This implies that the de-contextualized picture of language and the brain that has emerged from studying isolated sounds or words is incomplete if not misleading.