Translation Technology for Language Modeling

Virtually all natural language systems that produce text -- from speech recognition to natural language generation -- use as a core component a language model in order to differentiate between outputs that are well-formed and outputs that are not, and to rank word strings according to the appropriateness for a given context. Most often, humans blame the low quality of the outputs of these systems on the inability of the language model to understand what ``well-formedness'' means. The difficulty of modeling well-formedness is due not only to the mathematical and algorithmic challenges specific to the integration of multiple sources of knowledge, but also to the lack of robust, scalable tools and models for semantic, syntactic, and morphological processing.

The overall goal of this project is to develop novel statistical and linguistic techniques that will exploit the information that is available in parallel multilingual corpora (i.e. translations of the same source in multiple languages). Such corpora implicitly encode a hidden, common core that can be uncovered using state-of-the-art parameter estimation techniques. The research plan involves two main thrusts: i) automatic learning of structure in and across languages at multiple levels of abstraction: semantics, morphology, phonology, and paraphrasing, and ii) integration of the results into novel language model frameworks and estimation procedures to address the problem of limited domain- and language-specific training data. The hypothesis is that, by sharing data and structure across languages and genres within a language, the resulting models will be richer and more robust. Such ideas were impossible to envision using only a single language or pair of languages, but they are feasible now with the availability of multilingual corpora and significant increases in computing power.

SPONSOR: NSF (IIS-0326276)

AWARD PERIOD: September 2003 - August 2007





Return to the SSLI Lab Projects Page.