High-Order Modeling Techniques for Continuous Speech Recognition

The goal of this work is to develop and explore novel stochastic modeling techniques for acoustic and language modeling in large vocabulary continuous speech recognition, particularly recognition of spontaneous speech. Although significant advances have been made in recognition technology in recent years, spontaneous speech recognition accuracy is still hardly better than 50%. More casual speaking modes introduce additional sources of variability that require improvements at all levels of the recognition process, both in terms of the baseline stochastic models and the techniques for adapting these models. In addressing these challenges, the general theme of the research in this project is high-level correlation modeling, i.e. representing correlation of observations beyond the level of the frame or the word to dependencies within and across utterances associated with speaker, channel, topic and/or speaking style. Continuing the ARPA-ONR funded work at Boston University (BU) on segment-based acoustic modeling for speech recognition, the current project builds on the stochastic segment model, algorithms developed for distribution clustering in acoustic modeling and sentence-level mixture language modeling, and the BU recognition system in general. The recognition framework also includes a multi-pass search strategy to accommodate the higher-order (and therefore more computational) models explored here. In particular, we concentrate on three problems: development of hierarchical models of intra-utterance correlation of phones and model states, e.g. by extending the theory of Markov dependence trees; unsupervised adaptation of acoustic models within and across utterances based on these models; and sub-language modeling triggered by acoustic and dialog-level cues. In all cases, the approach involves developing formal models of statistical dependence that overcome limitations of existing models, in combination with exploring fast search and robust parameter estimation techniques to address the added complexity of these models. Although we consider radically new models, we also build on the existing strengths of speech recognition technology, both in the theoretical foundation and in the use of multi-pass search, with the intention that advances can be easily used in existing systems. (January 1995 -- December 1997)

SPONSOR: DoD, Office of Naval Research ONR-N00014-92-J-1778

PUBLICATIONS:

The publications below were supported all or in part by this grant. Publications supported by a previous related ONR-ARPA grant are listed on the publications page.

"Parameter Estimation of Dependence Tree Models Using the EM Algorithm," O. Ronen, J. R. Rohlicek and M. Ostendorf, manuscript submitted to IEEE Signal Processing Letters, Vol. 2, No. 8, August 1995, pp. 157-159.

"From HMMs to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition," M. Ostendorf, V. Digalakis and O. Kimball, manuscript submitted to IEEE Trans. on Speech and Audio Processing.

"Lattice-based Search Strategies for Large Vocabulary Speech Recognition," F. Richardson, Boston University M.S. Thesis, 1994.

"Auditory-based signal processing for speech recognition," S. Zlotkin, Boston University B.S. Project, 1995.

"The 1994 BU NAB News Benchmark System," M. Ostendorf, F. Richardson, R. Iyer, A. Kannan, O. Ronen and R. Bates, Proceedings of the ARPA Workshop on Spoken Language Technology, 1995, pp. 139-142.

F. Richardson, M. Ostendorf and J. R. Rohlicek, "Lattice-based Search Strategies for Large Vocabulary Recognition," Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 576-579, May 1995.


Return to the SSLI Lab Projects Page.