Dependence Tree Models of Intra-Utterance Phone Dependence

Orith Ronen

This thesis addresses the problem of modeling statistical dependence among a large set of random variables or vectors for use in pattern recognition applications. For a set of n random variables, the full joint probability function is in an n-dimensional space, or an nd-dimensional space for d-dimensional random vectors. For large n, it is useful to approximate the distribution in a manner that reduces dimensionality and still captures correlations. To achieve this goal, we approximate the joint distribution using a type of hierarchical models, called dependence trees. Dependence trees make a Markov assumption on the branches of a tree for modeling a set of random variables with no temporal structure.

As the primary application of this general approach, we explore long-term dependencies among sub-word units within an utterance, where the variables are units such as phones in English. The motivation for developing this model comes from speech recognition, based on the intuition that phones within an utterance are correlated because the utterance comes from one speaker. This effect is not included in current models that assume speech segments are independent, and it provides important information on how sounds are related to other sounds.

Although discrete dependence tree design algorithms exist, some modifications were needed to apply the technique to speech. We present extensions of prior work, and introduce a new model for continuous observation sequences using hidden dependence trees. Practical limitations of the original algorithm are addressed by robust topology design techniques. The contributions of the thesis also include the development of an efficient algorithm for training discrete and hidden dependence tree models with incomplete data, and the development of a two-level tree growing algorithm that enables the design of large dependence trees.

We apply the model for word recognition by combining its likelihood score with other acoustic and language model scores, showing a small reduction in recognition error rate. We also explore the context-dependent modeling problem with phonetic units conditioned on local context, which is an important step for future use of the model in speech recognition. We describe the mathematical framework for other speech processing applications of the model, and how it is applicable to problems in medical diagnosis as an example of the broad range of problems for which this work has implications.

The full thesis in postscript format. (2.21 MB)


Return to the SSLI Lab Graduate Students Theses Page.