Statistical Methods for Analysis and Recognition of Intonation Patterns in Speech

John W. Butzberger, Jr.

An important aspect of speech that current speech recognition and understanding systems do not typically employ is prosody. Prosody consists of intensity, duration, and intonation information, which can provide structural cures and semantic knowledge. Prosody has the potential to contribute in both speech synthesis and understanding stystems. This thesis describes a first step toward the analysis of prosody for speech understanding, specifically in the computational modelling of intontation using statistical methods. The model has potential applications as an additional knowledge source for recognizing and parsing spoken sentences.

The overall objective of this work is to understand the use of statistical modelling of intonation patterns generated in isolated words and in continuous speech. Our approach involves performing four major experiments: (1) isolated word intonation recognition, (2) boundary tone clustering, (3) boundary tone classification, and (4) spotting of boundary tone in continuous speech.

We employ discrete hidden Markov models (HMM) to characterize intonation patterns, because HMMs have been successful in modelling the random spectral and temporal structure of speech for work recognition. Since we use discrete distribution HMMs, vector quantization of the features is necessary to generate discrete observations, and different methods of vector quantization are explored.

For isolated word intonation recognition, we search for the best combination of feature processing, vector quantization, and hidden Markov modelling techniques for recognition of statement, question, command, calling, and continuation patterns. A best case accuracy of 89% was achieved using minimum distortion VQ and 3-state HMMs.

For boundary tone clustering, HMMs are used to characterized each cluster. Distinctions finer than "rise" and "fall" are obtained using a divisive clustering procedure. Typically, these distinctions were associated with prominence. A boundary tone classification experiment correctly identified discretely extracted boundary tones as rise or fall with 86% accuracy. Finally, boundary tones were spotted in continuous speech at an average detection rate of 33% with a false alarm rate of 1.0 pre known boundary tone.

Return to the SSLI Lab Graduate Students Theses Page.