Improving and Predicting Performance of Statistical Language Models in Sparse Domains

Rukmini M. Iyer

Standard statistical language models, or n-gram models, which represent the probability of word sequences, suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. This thesis focuses on improving the estimation of domain-dependent n-gram models by using out-of-domain text data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this thesis introduces two approaches that compensate for multi-domain differences, both representing "style" by part-of-speech (POS) sequences and "content" by the particular choice of words. First, data from multiple domains is combined using similarity weighting schemes that discriminate for content and style relevance prior to pooling multi-domain text. Second, n-gram distributions from multiple domains are combined, via a POS-dependent n-gram framework that separately compensate for word and POS usage differences. Two variations are explored: explicitly transforming the out-of-domain distribution before combining with an in-domain model, and separately estimating components of the POS-dependent n-gram model using multi-domain data. Finally, measures to analyze and predict recognition performance of language models are also investigated, resulting in an algorithm for predicting performance differences associated with localized changes in language models given a recognition system.

Experiments are mainly based on the Switchboard corpus of spontaneous conversations, with out-of-domain text drawn from the Wall Street Journal and Broadcast News corpora. However, portability of the techniques developed in this thesis is evaluated by additional experiments on a Spanish task. Both the data and distribution combination approaches lead to a 3-5% improvement in recognition performance over a domain-specific model, demonstrating larger gains than that obtained with previous approaches and the biggest gain from language modeling advances reported thus far on the Switchboard task. Furthermore, the new performance predictor demonstrates a 0.96 correlation with recognition performance compared to 0.83 for the existing perplexity measure, while providing a diagnostic of weaknesses of the language model under consideration. Results from this thesis impact the rapid development of new applications of speech and language technology, ranging from speech to handwriting recognition and from language transcription to understanding and translation.

The full thesis in postscript format. (811 kB)


Return to the SSLI Lab Graduate Students Theses Page.