Prediction of Speaker Age from Voice
The estimation of the age of a speaker from his or her voice has both forensic and commercial applications. Previous studies have shown that human listeners are able to estimate the age of a speaker to within 10 years on average, while recent machine age estimation systems seem to show superior performance with average errors as low as 6 years. However the machine studies have used highly non-uniform test sets, for which knowledge of the age distribution offers considerable advantage to the system. In this study we compare human and machine performance on the same test data chosen to be uniformly distributed in age. We show that in this case human and machine accuracy is more similar with average errors of 9.8 and 8.6 years respectively, although if panels of listeners are consulted, human accuracy can be improved to a value closer to 7.5 years. Both humans and machines have difficulty in accurately predicting the ages of older speakers.
Prediction of Fatigue and Stress from Voice in Safety-Critical Environments (iVOICE)
The iVOICE project was a feasibility study funded by the European Space Agency under the Artes 20 programme. The project partners were UCL Speech, Hearing and Phonetic Sciences, UCL Mullard Space Sciences Laboratory Centre for Space Medicine and the Gagarin Cosmonaut Training Centre (GCTC) in Star City, Russia. It ran from January 2014 to January 2015.
Performance-based measures of speech quality
Researchers: Mark Huckvale, Gaston Hilkhuysen. Funded by Research in Motion. Duration: 3 years: 2010-2013.
This project seeks to design and test new methods for the evaluation of speech communication systems. The area of application is for systems which operate at high levels of speech intelligibility or for systems which make little change to intelligibility (such as noise-reduction systems). Conventional intelligibility testing is not appropriate in these circumstances, and existing measures of speech quality are based on subjective opinion rather than speech communication performance.
It is common for people to report requiring more "effort" to perceive noisy speech. If true, then the effectiveness of digital noise reduction (DNR) could be measured by the reduction in "listening effort" it provides: a "higher quality" system should provide a greater reduction in listening effort compared to a "lower quality" system.
Traditional evaluations of auditory communication technologies (such as DNR systems) have relied on intelligibility scores (which often fail to distinguish between systems) and speech quality ratings (which rely on listener opinion).
But increased listening effort is associated with increased load on working memory which, in turn, can impact on the listener's memory and attention processes. Thus, this project aims to establish novel objective performance measures that target these processes in order to go beyond the traditional intelligibility and speech quality scores and establish listening effort as an evaluation criterion for all auditory communication technologies.
Centre for law enforcement audio research (CLEAR)
Researchers: Mark Huckvale, Gaston Hilkhuysen. Funded by the UK Home Office. Duration: 5 years 2007-2012
The CLEAR project aims to create a centre of excellence in tools and techniques for the cleaning of poor-quality audio recordings of speech. The centre is initially funded by the U.K. Home Office for a period of five years and is run in collaboration with the Department of Electrical and Electronic Engineering at Imperial College.
Modelling speech prosody based on communicative function and articulatory dynamics
Researchers: Santitham Prom-On and Yi Xu. Newton Fellowship to Santitham Prom-On: Mentor Yi Xu. Duration 3 years: January 2011-December 2013.
Prosody is an important aspect of speech that contributes to expressiveness and intelligibility of the speech. Quantitative modeling of speech prosody is a key in the advancement of speech science and technology. Based on a previous successful research collaboration, the proposed research will be a major systematic effort to develop an “articulatory-functional” quantitative model of speech prosody and integrate into it meaningful communicative functions.
SYNFACE: Synthesised talking face derived from speech for hearing disabled users of voice channels
The main purpose of the SYNFACE project is to increase the possibilities for hard of hearing people to communicate by telephone. Many people use lip-reading during conversations, and this is especially important for hard of hearing people. However, this clearly doesn't work over the telephone!. This project aims to develop a talking face controlled by the incoming telephone speech signal. The talking face will facilitate speech understanding by providing lip-reading support. This method works with any telephone and is cost-effective compared to video telephony and text telephony that need compatible equipment at both ends.
Dates:2001-2004. Funded by:CEC Framework V. Duration:3 years. Researchers: Andrew Faulkner, Ruth Campbell, Mark Huckvale
The size code in the expression of anger and joy in speech
To test the "size code" hypothesis for encoding anger and joy in speech. According to the hypothesis, these two emotions are conveyed in speech by exaggerating or understating the body size of the speaker, just as nonhuman animals exaggerate or understate their body size to communicate threat or appeasement. We will conduct acoustic analysis of publicly available emotional speech databases, and synthesize Thai vowels with a 3D articulatory synthesizer using parameter manipulations suggested by the size code hypothesis, and asked Thai listeners to judge the body size and emotion of the speaker. Initial results are in support of the size code hypothesis.
Dates:2005. Funded by:Collaborative. Researchers: Yi Xu With Suthathip Chuenwattanapranithi, King Mongkutâs University of Technology Thonburi, Thailand.
Spoken Language Conversion with Accent Morphing
Spoken language conversion is the challenge of using synthesis systems to generate utterances in the voice of a speaker but in a language unknown to the speaker. Previous approaches have been based on voice conversion and voice adaptation technologies applied to the output of a foreign language TTS system. This inevitably reduces the quality and intelligibility of the output, since the source speaker will not be a good source of phonetic material in the new language. Our work contrasts previous work with a new approach that uses two synthesis systems: one in the source speaker's voice, one in the voice of a native speaker of the target language. Audio morphing technology is then exploited to correct the foreign accent of the source speaker, while at the same time trying to maintain his or her identity. In this project we aim to construct a spoken language conversion system using accent morphing and evaluate its performance in terms of intelligibility and speaker identity.
Dates:2006-. Researchers: Mark Huckvale, Kayoko Yanagisawa
KLAIR - a virtual infant
The KLAIR project aims to build and develop a computational platform to assist research into the acquisition of spoken language. The main part of KLAIR is a sensori-motor server that supplies a client with a virtual infant on screen that can see, hear and speak. The client can monitor the audio visual input to the server and can send articulatory gestures to the head for it to speak through an articulatory synthesizer. The client can also control the position of the head and the eyes as well as setting facial expressions. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction.
Dates:2009. Researchers: Mark Huckvale
Quantitative modeling of tone and intonation
To develop a quantitative Target Approximation (qTA) model for simulating F0 contours of speech. Following the articulatory-functional framework of the PENTA model (Xu, 2005), the qTA model simulates the production of tone and intonation as a process of syllable-synchronized sequential target approximation. In the model, tone and intonation are treated as communicative functions that directly specify the parameters of the qTA model. The numerical values of the qTA will be extracted from natural speech via supervised learning. And the quality of the modeling output will be both numerically assessed and perceptually evaluated.
Dates:2005. Funded by:Collaborative. Researchers: Yi Xu with Santitham Prom-on and Bundit Thipakorn, King Mongkutâ's University of Technology Thonburi, Thailand
Role of sensory feedback in speech production as revealed by the effects of pitch- and amplitude-shifted auditory feedback
The overall goal of this research project is to understand the function of sensory feedback in the control of voice fundamental frequency (F0) and intensity through the technique of reflex testing. The specific aims of the project are: to determine if the pitch-shift and loudness-shift reflex magnitudes depend on vocal task; to determine if the direction of pitch-shift and loudness-shift reflexes depend on the reference used for error correction; and to investigate mechanisms of interaction between kinesthetic and auditory feedback on voice control. The overall hypothesis is that sensory feedback is modulated according to the specific vocal tasks in whish subjects are engaged. By testing reflexes in different tasks, we will learn how sensory feedback is modulated in the tasks. We also hypothesize that auditory reflexes, like reflexes in other parts of the body, may reverse their direction depending on the vocal task. The mechanisms controlling such reflex reversals will be investigated, and this information will be important for understanding some voice disorders. It is also hypothesized that kinesthetic and auditory feedback interact in their control of the voice. Applying temporary anesthetic to the vocal folds and simultaneously testing auditory reflexes will provide important information on brain mechanisms that govern interaction between these two sources of feedback.
Dates:2004-2009. Funded by:Internal/NIH. Duration:4 years. Researchers: Yi Xu With Charles Larson and colleagues, Northwestern University, USA (funded by NIH: 2004-2009).