Computational Models of the Prosody/Syntax Mapping for Spoken Language Systems

Nanette Marie Veilleux

Prosodic information, encoded in speech as the grouping of words (phrasing) and the relative prominence of some syllables in an utterance, is important in human understanding of speech. In order to use prosodic information in automatic spoken language systems, computational models of the relationship between prosody and syntactic structure (which is in turn related to meaning) are needed.

This thesis develops two different models of the prosody/syntax mapping (a hierarchical model and a decision tree model of prosodic phrasing) and a joint model of the mapping from the acoustic signal to syntax. The joint model of the acoustic/syntax mapping is accomplished by combining a prosody/syntax model, which represents the probabilistic relationship between syntax and abstract units of prosody, with another model that represents the mapping between these abstract units and acoustic features. In this way, prosodic structure serves as an intermediate representation between the acoustic and syntactic domains. The joint acoustic/prosody/syntax model is used in speech understanding to compute a prosody-parse score, which expresses the degree of the match between acoustic features and a proposed syntactic representation.

One major contribution of this work is that the computational models are formulated in a probabilistic framework that uses decision trees in a non-traditional way, to estimate probability distributions. The models themselves represent a significant contribution in part because each demonstrates that the same models can be used in both synthesis and understanding applications. The usefulness of these models is demonstrated in three applications. First, the decision tree and the hierarchical model are used to predict the correct placement of prosodic phrase boundaries, exploiting the relationship between prosody and syntax to improve synthetic speech quality. Second, the probabilistic prosody-parse scoring system is used to automatically select between two possible interpretations of an utterance, achieving performance close to that of human listeners. Finally, the prosody-parse scoring system is used in an existing automatic speech understanding system to improve word recognition performance. Although their utility is demonstrated in specific implementations, the models presented here are general and the contributions of this work can extend beyond the specific applications presented.

The full thesis in postscript format. (998 kB)

Return to the SSLI Lab Graduate Students Theses Page.