Training Statistical Language Models
with Out-of-Domain Data

Currently, the most successful technique for language modeling in speech recognition seems to be the trigram model. However, a trigram language model generally requires a large amount of data in order to obtain robust statistics, and a trigram model designed on one type of corpus may not work well when used in a very different task. For example, on the Switchboard conversational speech data, a trigram language model trained on a very large corpus of North American Business News (NABN) gives far worse results (perplexity and recognition accuracy) than the trigram trained only on the Switchboard corpus which is an order of magnitude smaller than NABN. This result is not surprising given the drastically different nature of the two data sets: a formal writing style about business-related topics vs. a very informal and disfluent conversational style that tends to be more oriented toward stories of individual experiences. However, if speech recognition technology is to be easily used for new tasks, the problem of estimating good language models from a small amount of data is a critical obstacle to be addressed, and it is desirable to find some way of using other corpora.

In this work, we are addressing the problem of designing and training statistical language models where a relatively small amount of data is available in the domain of interest. In other words, we address the problem of ``portability'' of language modeling techniques. We focus on estimating language models for conversational speech, where data is less easily collected. In particular, we have concentrated on the Switchboard task, but have also applied some of the techniques to the Spanish Callhome data. There are several possible directions that one might pursue for addressing this problem, including augmenting the small training set with data from other corpora and reducing the number of free parameters in the language model so that it can be trained with the data available. Our approaches rely on the use of multi-domain data, with the following main thrusts

In the process of this work, we found that perplexity is a poor predictor of recognition performance when out-of-domain data is used, even worse than has been reported in the past. As a consequence, we are exploring alternative performance prediction techniques.

(September 1995 -- September 1998)




The publications below were supported all or in part by this grant. Publications supported by a related ONR-ARPA grant are listed on the publications page.

``Using Out-of-Domain Data to Improve In-Domain Language Models,'' R. Iyer, M. Ostendorf and H. Gish, IEEE Signal Processing Letters, to appear August 1997.

``Transforming Out-of-Domain Estimates to Improve In-Domain Language Models,'' R. Iyer and M. Ostendorf, Proc. Eurospeech, vol. 4, pp. 1975-1978, 1997.

``Analyzing and Predicting Language Model Improvements,'' R. Iyer, M. Ostendorf, and M. Meteer, IEEE Workshop on Speech Recognition and Understanding Proceedings, (S. Furui, B.-H. Juang, & W. Chou, eds.) pp. 254-261, 1997.

``Relevance Weighting for Combining Multi-Domain Data for N-Gram Language Modeling,'' R. Iyer and M. Ostendorf, manuscript submitted to Computer Speech and Language.

Return to the SSLI Lab Projects Page.