Translation Technology for Language Modeling
Virtually all natural language systems that produce text -- from speech
recognition to natural language generation --
use as a core component a language model in order
to differentiate between outputs that are well-formed and outputs that
are not, and to rank word strings according to the appropriateness for
a given context. Most often, humans blame the low quality of the
outputs of these systems on the inability of the language model to
understand what ``well-formedness'' means. The difficulty of modeling
well-formedness is due not only to the mathematical and
algorithmic challenges specific to the integration of multiple sources
of knowledge, but also to the lack of robust, scalable tools and
models for semantic, syntactic, and morphological processing.
The overall goal of this project is to develop novel statistical and
linguistic techniques that will exploit the information that is
available in parallel multilingual corpora (i.e. translations of the
same source in multiple languages). Such corpora implicitly encode a
hidden, common core that can be uncovered using state-of-the-art
parameter estimation techniques. The research plan involves two main
thrusts: i) automatic learning of structure in and across languages at
multiple levels of abstraction: semantics, morphology, phonology, and
paraphrasing, and ii) integration of the results into novel language
model frameworks and estimation procedures to address the problem of
limited domain- and language-specific training data. The hypothesis is
that, by sharing data and structure across languages and genres within
a language, the resulting models will be richer and more robust. Such
ideas were impossible to envision using only a single language or pair
of languages, but they are feasible now with the availability of
multilingual corpora and significant increases in computing power.
SPONSOR: NSF (IIS-0326276)
AWARD PERIOD: September 2003 - August 2007
UW TEAM MEMBERS:
ISI TEAM MEMBERS:
- K. Duh and K. Kirchhoff, "Automatic Learning of Language
Model Structure," COLING 2004, Geneva, Switzerland
- D. Vergyri, K. Kirchhoff, K. Duh, A. Stolcke, "Morphology-Based
Language Modeling for Arabic Speech Recognition," ICSLP 2004, Korea.
- K. Duh and K. Kirchhoff, "Automatic Learning of Language Model
Structure," UWEE Tech Report 2004-0014.
Return to the SSLI Lab Projects Page.