Segment-based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition

Vassilios V. Digalakis

This dissertation addresses the problem of modeling the joint time-spectral structure of speech for recognition. Four areas are covered in this work: segment modeling, estimation, recognition search algorithms, and extension to a more general class of models. A unified view of the acoustic models that are currently used in speech reognition is presented; the research is then focused on sement-based models that provide a better framework for modeling the intrasegmental statistical dependencies than the conventional hidden Markov models (HMMs). The validity of a linearity assumption for modeling the intrasegmental statistical dependencies is first checked, and it is shown that the basic assumption of conditionaly independent observations given the underlying state sequence that is inherent to HMMs is inaccurate. Based on these results, linear models are chosen for the distribution of the observations within a segment of speech. Motivated by the original work of the stochastic segment model, a dynamical system segment model is equivalent to the maximum likelihood identification of a stochastic linear system, and a simple alternative to the traditional approach is developed. This procedure is based on the Expectation-Maximization algorithm and is analogous to the Baum-Welch alogorithm for HMMs, since the dynamical system segment model can be thought of as a continuous state HMM. Recognition involves computing the probability of the innovations given by Kalman filtering. The large computational complexity of segment-based models is dealt with by the introduction of fast recognition search algorithms as alternatives to the typical Dynamic Programming search. A Split-and-Merge segmentation algorithm is developed that achieves a significant computation reduction with no loss in recognition performance. Finally, the models are extended to the family of embedded segment models that are better suited for capturing the hierarchical structure of speech and modeoing intersegmental statistical dependencies. Experimental results are based on speaker-indepent phoneme recognition using the TIMIT database, and represent the best context-independent phoneme recognition performance reported on this task. In addition, the proposed dynamical system segment model is the first that removes the output independence assumption.

The full thesis in postscript format. (854 kB)

Return to the SSLI Lab Graduate Students Theses Page.