Segment-based Stochastic Models of Spectral Dynamics for
Continuous Speech Recognition
Vassilios V. Digalakis
This dissertation addresses the problem of modeling the joint time-spectral
structure of speech for recognition. Four areas are covered in this work:
segment modeling, estimation, recognition search algorithms, and extension
to a more general class of models. A unified view of the acoustic models
that are currently used in speech reognition is presented; the research is then
focused on sement-based models that provide a better framework for modeling
the intrasegmental statistical dependencies than the conventional hidden Markov
models (HMMs). The validity of a linearity assumption for modeling the
intrasegmental statistical dependencies is first checked, and it is shown that
the basic assumption of conditionaly independent observations given the
underlying state sequence that is inherent to HMMs is inaccurate. Based on
these results, linear models are chosen for the distribution of the
observations within a segment of speech. Motivated by the original work
of the stochastic segment model, a dynamical system segment model is
equivalent to the maximum likelihood identification of a stochastic linear
system, and a simple alternative to the traditional approach is developed.
This procedure is based on the Expectation-Maximization algorithm and is
analogous to the Baum-Welch alogorithm for HMMs, since the dynamical system
segment model can be thought of as a continuous state HMM. Recognition
involves computing the probability of the innovations given by Kalman
filtering. The large computational complexity of segment-based models is
dealt with by the introduction of fast recognition search algorithms as
alternatives to the typical Dynamic Programming search. A Split-and-Merge
segmentation algorithm is developed that achieves a significant computation
reduction with no loss in recognition performance. Finally, the models are
extended to the family of embedded segment models that are better suited for
capturing the hierarchical structure of speech and modeoing intersegmental
statistical dependencies. Experimental results are based on
speaker-indepent phoneme recognition using the TIMIT database, and represent
the best context-independent phoneme recognition performance reported on this
task. In addition, the proposed dynamical system segment model is the first
that removes the output independence assumption.
The full thesis in postscript format. (854 kB)
Return to the SSLI Lab Graduate Students Theses Page.