Computational Modeling of Intonation for Synthesis and Recognition

Higher quality text-to-speech synthesis is needed for a range of applications, including voice response for telephone-based information access as well as more general human-machine communication. It is generally agreed that prosody - the phrase and accent structure of speech that provides information about sentence meaning - is one of the most critical aspects of synthesis technology to improve, and intonation is an important component of prosody. Thus the goal of the research is to develop a computational model of intonation and to provide algorithms for generating prosodic controls that can be integrated into a text-to-speech synthesis system to obtain higher quality synthetic speech. A secondary goal is to develop models that can easily be customized to different voices and task domains, so the focus is on automatically trainable models which can be used for both recognition (to label data from a new speaker) and synthesis.

The first stage of the project focused on models for the generation of intonation patterns from text, including phrasal prominence and tune patterns at the abstract level and F_0 and energy contours at the acoustic level. Our strategy was to combine the results of recent developments in linguistic theory and prosodic transcription with sophisticated statistical signal processing techniques that allow automatic estimation of model parameters. The project resulted in algorithms for (1) predicting prominence placement from text, and (2) generating F_0 and energy contours from abstract phonological labels. The algorithms can be easily incorporated in existing synthesis systems, and they have found good listener perceptual ratings when the models are incorporated into the AT&T TTS system. In addition, preliminary results suggest that the F_0 generation model works well for recognition of prosodic labels. Current efforts involve extending and improving the recognition results.

NYNEX, June 1993 - March 1995
NSF, April 1995 - June 1995
Entropic Research Lab, Jan 1996 - Dec 1996


(supported all or in part by this grant)

"A Dynamical System Model for Generating F_0 for Synthesis," K. Ross and M. Ostendorf, Proceedings of the ESCA/IEEE Workshop on Speech Synthesis, pp. 131-134, Sept. 1994.

"A Dynamical System Model for Recognizing Intonation Patterns," K. Ross and M. Ostendorf, Proc. Eurospeech, Sept. 1995.

"A Dynamical System Model for Generating Fundamental Frequency for Speech Synthesis," K. Ross and M. Ostendorf, submitted manuscript.

"Prediction of Abstract Prosodic Labels for Speech Synthesis," K. Ross and M. Ostendorf, submitted manuscript.

Return to the SSLI Lab Projects Page.