Improving and Predicting Performance of Statistical
Language Models in Sparse Domains
Rukmini M. Iyer
Standard statistical language models, or n-gram models,
which represent the probability of word sequences, suffer from
sparse-data problems in tasks where large amounts of domain-specific
text are not available. This thesis focuses on improving the
estimation of domain-dependent n-gram models by using
out-of-domain
text data. Previous approaches for estimating language models from
multi-domain data have not accounted for the characteristic variations
of style and content across domains. In contrast, this thesis
introduces two approaches that compensate for multi-domain
differences, both representing "style" by part-of-speech (POS)
sequences and "content" by the particular choice of words. First,
data from multiple domains is combined using similarity weighting
schemes that discriminate for content and style relevance prior to
pooling multi-domain text. Second, n-gram distributions from
multiple domains are combined, via a POS-dependent n-gram
framework
that separately compensate for word and POS usage differences. Two
variations are explored: explicitly transforming the out-of-domain
distribution before combining with an in-domain model, and separately
estimating components of the POS-dependent n-gram model using
multi-domain data. Finally, measures to analyze and predict
recognition performance of language models are also investigated,
resulting in an algorithm for predicting performance differences
associated with localized changes in language models given a
recognition system.
Experiments are mainly based on the Switchboard corpus of spontaneous
conversations, with out-of-domain text drawn from the Wall Street
Journal and Broadcast News corpora. However, portability of
the techniques developed in this thesis is evaluated by additional
experiments on a Spanish task. Both the data and distribution
combination approaches lead to a 3-5% improvement in recognition
performance over a domain-specific model, demonstrating larger gains
than that obtained with previous approaches and the biggest gain from
language modeling advances reported thus far on the Switchboard
task. Furthermore, the new performance predictor demonstrates a 0.96
correlation with recognition performance compared to 0.83 for the
existing perplexity measure, while providing a diagnostic of
weaknesses of the language model under consideration. Results from
this thesis impact the rapid development of new applications of speech
and language technology, ranging from speech to handwriting
recognition and from language transcription to understanding and
translation.
The full thesis in postscript format. (811 kB)
Return to the SSLI Lab Graduate Students Theses Page.