Training Statistical Language Models
with Out-of-Domain Data
Currently, the most successful technique for language modeling in
speech recognition seems to be the trigram model. However, a trigram
language model generally requires a large amount of data in order to
obtain robust statistics, and a trigram model designed on one type of
corpus may not work well when used in a very different task. For
example, on the Switchboard conversational speech data, a trigram
language model trained on a very large corpus of North American
Business News (NABN) gives far worse results (perplexity and
recognition accuracy) than the trigram trained only on the Switchboard
corpus which is an order of magnitude smaller than NABN. This result
is not surprising given the drastically different nature of the two
data sets: a formal writing style about business-related topics vs. a
very informal and disfluent conversational style that tends to be more
oriented toward stories of individual experiences. However, if speech
recognition technology is to be easily used for new tasks, the problem
of estimating good language models from a small amount of data is a
critical obstacle to be addressed, and it is desirable to find some
way of using other corpora.
In this work, we are addressing the problem of designing
and training statistical language models where a relatively
small amount of data is available in the domain of interest.
In other words, we address the problem of ``portability'' of
language modeling techniques. We focus on estimating
language models for conversational speech, where data is less
easily collected. In particular, we have concentrated on the
Switchboard task, but have also applied some of the techniques
to the Spanish Callhome data.
There are several possible directions that one
might pursue for addressing this problem, including augmenting the
small training set with data from other corpora and reducing the number
of free parameters in the language model so that it can be trained with
the data available. Our approaches rely on the use of multi-domain
data, with the following main thrusts
In the process of this work, we found that perplexity is a poor
predictor of recognition performance when out-of-domain data is
used, even worse than has been reported in the past. As a consequence,
we are exploring alternative performance prediction techniques.
- Finding distance metrics that quantify similarity of
corpora or substrings of corpora, to facilitate selective
use of other corpora in weighted data combination;
- Incorporating data from other corpora separately for word
and part-of-speech components in a model that uses part-of-speech
classes for smoothing rather than simplification; and
- Investigating the trade-offs of incorporating out-of-domain
data in a single vs. mixture language model framework.
(September 1995 -- September 1998)
SPONSOR: BBN Inc.
- PI: Prof. Mari Ostendorf
- Graduate students:
Rukmini Iyer, Ph.D. 1998
Yuliya Lobacheva, M.S. candidate
- Undergraduate students:
Karen Gastaldo, B.A. candidate
Claudia Revueltas, B.S. candidate
The publications below were supported all or in part by this
grant. Publications supported by a related ONR-ARPA
grant are listed on the
``Using Out-of-Domain Data to Improve In-Domain Language Models,''
R. Iyer, M. Ostendorf and H. Gish, IEEE
Signal Processing Letters, to appear August 1997.
``Transforming Out-of-Domain Estimates to Improve In-Domain Language
Models,'' R. Iyer and M. Ostendorf, Proc. Eurospeech, vol. 4,
pp. 1975-1978, 1997.
``Analyzing and Predicting Language Model Improvements,'' R. Iyer,
M. Ostendorf, and M. Meteer, IEEE Workshop on
Speech Recognition and Understanding Proceedings, (S. Furui, B.-H. Juang,
& W. Chou, eds.) pp. 254-261, 1997.
``Relevance Weighting for Combining Multi-Domain Data for
N-Gram Language Modeling,'' R. Iyer and M. Ostendorf,
manuscript submitted to Computer Speech and Language.
Return to the SSLI Lab Projects Page.