Modeling of Intonation for Speech Synthesis

Ken Ross

Higher quality speech synthesis is needed to make text-to-speech technology useful in more applications, and prosody -- the suprasegmental aspects of speech that supply information about sentence meaning -- is one of the aspects of synthesis technology most needing improvement. The goal here is to develop automatically trainable computational models for prosody that can be incorporated into existing text-to-speech synthesizers. This model is constructed in two modules: the first predicts abstract prosodic markers from text, and the second generates fundamental frequency (F_0) and energy contours from the abstract markers and text. This research draws on recent developments in linguistic theory to provide the structure for the models, and on recent advances in statistical modeling to provide a formalism for automatically generating the model parameters. Because statistical models are automatically trained, they have advantages over rule-based models, particularly that they can be easily modified to different speaking styles via retraining on a different corpus.

Specifically, this research creates decision tree models that predict the prosodic markers, and a dynamical system model that generates F_0 and energy contours. Classification trees in conjunction with a Markov sequence assumption predict pitch accents and phrase tone types. Additionally, regression trees estimate F_0 range and prominence levels. These trees use linguistically motivated features that are derived from text such as lexical stress and part-of-speech. The model for F_0 and energy generation is a unique approach that incorporates traditional methods of F_0 generation into a model whose parameters are estimated automatically from labeled speech. F_0 and energy are generated with a state-space dynamical system model that assumes there is an unobserved state vector corresponding to the noisy observations of F_0 and energy. Parameters are specified to capture segment, syllable, and phrase level effects. Since there is unobserved data, parameters are estimated using a non-traditional method based upon the EM algorithm. These two models are evaluated, independently and together, in quantitative and perceptual tests that demonstrate improvements in the quality of text-to-speech synthesis. These models are also demonstrated to be useful in prosody recognition applications.

The full thesis in postscript format. (1.37 MB)

Return to the SSLI Lab Graduate Students Theses Page.