Specifically, this research creates decision tree models that predict the prosodic markers, and a dynamical system model that generates F and energy contours. Classification trees in conjunction with a Markov sequence assumption predict pitch accents and phrase tone types. Additionally, regression trees estimate F range and prominence levels. These trees use linguistically motivated features that are derived from text such as lexical stress and part-of-speech. The model for F and energy generation is a unique approach that incorporates traditional methods of F generation into a model whose parameters are estimated automatically from labeled speech. F and energy are generated with a state-space dynamical system model that assumes there is an unobserved state vector corresponding to the noisy observations of F and energy. Parameters are specified to capture segment, syllable, and phrase level effects. Since there is unobserved data, parameters are estimated using a non-traditional method based upon the EM algorithm. These two models are evaluated, independently and together, in quantitative and perceptual tests that demonstrate improvements in the quality of text-to-speech synthesis. These models are also demonstrated to be useful in prosody recognition applications.
The full thesis in postscript format. (1.37 MB)