Segment Modeling Alternatives for Continuous Speech Recognition

Owen Kimball

This dissertation presents alternative parametric statistical models of phonetically-based segments for use in continuous speech recognition (CSR). A categorization of segment modeling approaches is proposed according to two characteristics: the assumed form of the probability distribution and the representation chosen for segment observations. The question of distribution form divides models into two groups: those based on conditional probability densities of feature given label and those using a posteriori probabilities of label given feature. The second characteristic concerns whether a model uses a variable or fixed-length representation of observed speech segments. The choices for both characteristics have important implications, particularly for context modeling and score normalization. In this work, specific segment models are developed in order to understand the benefits and limitations that follow from these choices.

Mixture distributions are a particular type of conditional density with appealing modeling properties. Under a special case of segment models using variable-length representations and conditional densities, various forms of Gaussian mixture models are examined for the individual samples of the feature sequence. Within this framework, a systematic comparison of both existing and novel mixture modeling techniques is conducted. Parameter-tying alternatives for frame-level mixtures are explored and good performance is demonstrated with this approach.

Within the conditional-density variable-length framework, a generalization of mixture distributions that captures properties of the complete segment is proposed in the form of a segment-level mixture model. This approach models intra-segment correlation indirectly using a mixture of segment-length models, each of which uses conditionally independent time samples. Parameter estimation formulae are derived and the model is explored experimentally.

The alternative assumption of modeling based on a posteriori probabilities is examined through the development of a recognition formalism using classification and segmentation scoring. Posterior distributions have been less well studied than conditional densities in the context of CSR, and this work introduces a theoretically consistent, segment-level posterior distribution model using context-dependent models. Issues concerning fixed versus variable-length representations and segmentation scoring are explored experimentally. Finally, some general conclusions are drawn concerning the practical and theoretical trade-offs for the models examined.

The full thesis in postscript format. (803 kB)

Return to the SSLI Lab Graduate Students Theses Page.