Learning Local Lexical Structure in Spontaneous Speech Language Modeling

Man-Hung Siu

Although significant progress has been made in automatic speech recognition on read corpora in recent years, the state-of-the-art recognition performance on the more difficult task of transcribing unconstrained conversational speech is still at 40% word error rate. One important problem in speech recognition is the representation of the prior probability of a word sequence, known as language modeling. Most recognition systems use n-gram language models. An n-gram model assumes that a word is dependent only on its previous n-1 words. However, it fails to capture local structure, especially that observed in conversational speech. It also requires the estimation of a large number of parameters.

The goal of this thesis is to improve the n-gram model by learning local lexical structure. We focus on capturing three types of local lexical structure. First, we model words with an expanded vocabulary to account for the observation that a word can have different communication functions, such as part of speech. Second, we extend a variable n-gram learning algorithm that allows both skips and word equivalence classes in word history. The skips are motivated by the occurrences of disfluencies in conversational speech, such as pause fillers and repetitions. The combination of variable n-gram histories and classes allows for an extended maximum length history while reducing the number of parameters. Third, we develop algorithms to learn multi-word lexical units (e.g. "you know") using a special form of variable n-gram learning algorithm. These multi-word lexical units can be modeled either deterministically or non-deterministically. We evaluate our models based on the number of free parameters in the model, the test-set perplexity (a measure of domain difficulty based on entropy) and recognition word error rate.

We show that by learning the local lexical structure of the language, we can reduce the number of parameters needed by more than 40%, and at the same time, reduce the test-set perplexity by 8% and recognition word error by 1%.

The full thesis in postscript format. (2.00 MB)


Return to the SSLI Lab Graduate Students Theses Page.