Duration Modeling for Speech Synthesis and Recognition

Cynthia Fong

This thesis describes research on duration modeling for both speech synthesis and recognition applications. In the past, segmental duration models were developed primarily for synthesis systems. While these models strove to incorporate durationally relevant information, the cues were mostly syntactic in nature. In comparison, models used in current speech recognition systems are very simple and do not take advantage of knowledge gained from synthesis work. In addition, the durational probabilities have almost no weight in the recognition score, given the high dimensionality of the acoustic feature vectors typically used. The models presented here advance and build upon previous work by incorporating prosodic factors directly into the synthesis model, and by leveraging common techniques in both synthesis and recognition models.

Although the synthesis and recognition applications are different, the sources of variability are the same in both cases and so a common modeling approach is possible. In both applications, our duration model uses automatic learning methods of decision trees and divisive clustering. By using decision trees as part of the modeling procedure, we avoid having to make any assumptions concerning the relationships between different factors affecting duration. The models also make use of factors known to influence duration, including prosodic features which have shown to be important from our corpus analyses. The synthesis model has two types of automatically trained parameters, one based on prosodic classes and the other generated by decision trees. Modeling duration for speech recognition involves estimating parameters for the assumed Gamma distributions of different durational classes, using binary clustering with a maximum likelihood criterion to determine conditioning factors. Durational probabilities are separately weighted in recognition scoring to take full advantage of this information.

Contributions of this thesis include: corpus analysis results that provide a better understanding of duration as a cue to prosodic patterns, a new parametric approach to duration modeling for synthesis, and an improved duration model for recognition that leads to a 8% reduction in word error rate.

Return to the SSLI Lab Graduate Students Theses Page.