Structural Alternatives in Stochastic Segment Modeling for Speech Recognition

Ibrahim M. Bechwati

The Stochastic Segment Model is a new approach used in continuous speech recognition systems. It is a fixed length stochastic model used to represent variable length speech segments. The segment model is based on the assumption that speech segments have an unobserved trajectory in k-dimensional feature space, which is modeled as a sequence of m k-dimensional feature vectors. There is a trade-off between improved recognition preformance associated with more complex models and performance degradation associated with insufficient date for estimating large numbers of parameters with a limited amount of training data. The length of the sequence (m), the number of features (k), the number of models per phone and other assumptions determine the complexity and the robustness of the model for a giving training data.

The goal of this work was to improve recognition performance by investigating different model structures to achieve a balance between complexity and robust parameter estimates. In order to achieve that we have investigated: (1) robust parameter estimation for different numbers of parameters in context-independent models; (2) speaker-sex and phone length-dependent conditional models; and (3) context-dependent models (conditioning on left and right phoneme context) using thresholding design.

By increasing the number of model parameters (by increasing k, or m, or by using block-diagonal versus diagonal covariance structures) we determine the effects of limited training data on the model robustness. Designing sex-dependent, length-based on the size of training data increases the complexity of the stochastic segment models. In all of these conditional models, except length-dependent models where no significant improvement is made, the additional information was important enough to compensate for the loss of robustness caused by increasing the number of parameters. The sex-dependent models have reduced the error number of the baseline system by 21%, and the context-dependent have reduced the error number by almost 20%.

Return to the SSLI Lab Graduate Students Theses Page.