Automatic Speech Recognition as a Model for Human Speech Perception
The goal of this project is to use an automatic speech recognizer (ASR) to simulate
and predict properties of human speech perception. This approach is based on
the more general methodology of using computational learning algorithms as a
model of human learning, a concept which has increasingly attracted attention
over the past decade. Although it is widely agreed that ASR in not an adequate
model of adult speech perception, is may be a reasonable approximation to the
perception of phonetic categories by infants, for which certain factors, such
as higher-level contextual knowledge, do not play a significant role.
Our specific goal is to model the acquisition of phonetic distinctions
(phoneme categories) by infant listeners, using both 'motherese'
(overarticulated) speech data and adult speech data as training and test
material for a statistical speech recognizer. Results from experimental
studies on infant speech perception have shown that infants prefer extremely
clear, hyperarticulated ('motherese') speech to normal adult speech. In
motherese, formant space regions covered by different vowels are spread
further apart than in adult speech. It is likely that these extreme or
prototypical 'training samples' help infants to acquire phonetic categories
and to make appropriate generalizations when classifying normal adult
speech. If automatic speech recognition were a good model of human speech
perception, the same effect should be observable when an automatic speech
recognizer is trained on motherese speech samples and tested on adult
speech. It is generally known in the speech recognition community that
mismatched training and test conditions decrease performance. However, it has
not yet been tested whether systematically mismatched training data in the
form of prototypical examples has a beneficial effect on classifying test data
which shows a greater overlap of different class regions.
The potential benefits of this project are twofold: first,
the suitability of automatic speech recognition (ASR) as a computational model
of phonetic learning will be tested. Second, the results from
this pilot project will be used to develop improved training and
adaptation methods for automatic speech recognition.
SPONSOR
Institute for Learning and Brain Sciences
PUBLICATIONS:
- K. Kirchhoff and S. Schimmel, "Statistical modelling of infant-directed vs. adult-directed speech:
insights from speech recognition", 146th Meeting of the Acoustical Society of America , Austin, Texas, 2003
- K. Kirchhoff and S. Schimmel, "Statistical properties of infant-directed vs. adult-directed speech: insights from speech recognition", Journal of the Acoustical Society of America, in press