Although the synthesis and recognition applications are different, the sources of variability are the same in both cases and so a common modeling approach is possible. In both applications, our duration model uses automatic learning methods of decision trees and divisive clustering. By using decision trees as part of the modeling procedure, we avoid having to make any assumptions concerning the relationships between different factors affecting duration. The models also make use of factors known to influence duration, including prosodic features which have shown to be important from our corpus analyses. The synthesis model has two types of automatically trained parameters, one based on prosodic classes and the other generated by decision trees. Modeling duration for speech recognition involves estimating parameters for the assumed Gamma distributions of different durational classes, using binary clustering with a maximum likelihood criterion to determine conditioning factors. Durational probabilities are separately weighted in recognition scoring to take full advantage of this information.
Contributions of this thesis include: corpus analysis results that provide a better understanding of duration as a cue to prosodic patterns, a new parametric approach to duration modeling for synthesis, and an improved duration model for recognition that leads to a 8% reduction in word error rate.