) and energy
contours from the abstract markers and text. This research draws on
recent developments in linguistic theory to provide the structure for
the models, and on recent advances in statistical modeling to provide
a formalism for automatically generating the model parameters.
Because statistical models are automatically trained, they have
advantages over rule-based models, particularly that they can be
easily modified to different speaking styles via retraining on a
different corpus.
Specifically, this research creates decision tree models that predict
the prosodic markers, and a dynamical system model that generates F
and energy contours. Classification trees in conjunction with a
Markov sequence assumption predict pitch accents and phrase tone
types. Additionally, regression trees estimate F
range and
prominence levels. These trees use linguistically motivated features
that are derived from text such as lexical stress and part-of-speech.
The model for F
and
energy generation is a unique approach that
incorporates traditional methods of F
generation into a model whose
parameters are estimated automatically from labeled speech. F
and
energy are generated with a state-space dynamical system model that
assumes there is an unobserved state vector corresponding to the noisy
observations of F
and
energy. Parameters are specified to capture
segment, syllable, and phrase level effects. Since there is
unobserved data, parameters are estimated using a non-traditional
method based upon the EM algorithm. These two models are evaluated,
independently and together, in quantitative and perceptual tests that
demonstrate improvements in the quality of text-to-speech synthesis.
These models are also demonstrated to be useful in prosody recognition
applications.
The full thesis in postscript format. (1.37 MB)