Modeling Structure in Speech above the Segment for Spontaneous Speech Recognition

Current speech recognition technology, while useful in tightly constrained domains with cooperative speakers, still leads to unacceptably high error rates (30-50%) on unconstrained speech such as the conversational speech in the Switchboard corpus and the radio news broadcasts in the DARPA Hub4 task. An important difference between these tasks and the controlled speech tasks that give 90% accuracy or better is the much larger variability in speaking style even within data from a single speaker. Since they do not account for the systematic factors behind this variability, current acoustic models must be ``broader,'' leading to more confusability among words and hence high error rates. To address this problem, this work proposes to improve acoustic models by representing structure in speech above the level of the phonetic segment.

Specifically, the work involves modeling structure at three time scales: the syllable, short regions or sequences of words within an utterance, and the session or conversation (i.e. speaker). At the syllable level, we are developing automatic clustering techniques for high-dimensional context vectors in order to capture the affects of syllable structure and longer contexts, as well as a parametric model of temporal variability. A goal is to move from phone to syllable-size units to facilitate modeling reduction phenomena where phone segments are dropped but the associated gestures are still apparent in the coarticulation effects on neighboring phones. At the region level, the project involves incorporating a slowly varying hidden speaking mode that is an indicator of systematic differences in pronunciations associated with reduced vs. clearly articulated speech. The hidden speaking mode is cued by acoustic features, such as speaking rate, relative energy and relative pitch range, as well as conversation-level word cues to information status. Currently, we are investigating the dependence of word-based and acoustic variation on dialog acts, such as statements, questions and back-channel acknowledgements. At the speaker level, the effort involves using hierarchical models of the correlation among speech sounds (e.g. between an ``m'' and an ``n'' for the same speaker) to improve adaptation of acoustic models. Such dependence models address the problem seen in many applications that the acoustic space is sparsely sampled because there is a small amount of data for adaptation. The three efforts have in common the general themes of developing new stochastic models that capture high-level structure, and using automatic learning to help define both the structure and appropriate regions of parameter tying for estimation. Automatic learning includes linguistically-motivated feature extraction so the algorithms benefit as much as possible from existing knowledge. All three efforts build on a common framework of non-stationary segment trajectory modeling, and the interactions between the different time scales will be leveraged in an integration of the three components in the final stage of the project.

Experiments involve large vocabulary recognition of conversational speech. The effort relies on a multi-pass recognition search strategy, which reduces the search space with standard hidden Markov models in order to allow rescoring with the higher-order (and therefore more computationally expensive) models developed here. Thus, the work builds on strengths of existing speech recognition technology, while exploring radically new knowledge sources (i.e. long-term dependence), a combination that has the potential to significantly advance the state of the art in speech recognition performance. (April 1997 -- March 2000)


AWARD PERIOD: April 1997 - March 2001



Return to the SSLI Lab Projects Page.