The goal of this work was to improve recognition performance by investigating different model structures to achieve a balance between complexity and robust parameter estimates. In order to achieve that we have investigated: (1) robust parameter estimation for different numbers of parameters in context-independent models; (2) speaker-sex and phone length-dependent conditional models; and (3) context-dependent models (conditioning on left and right phoneme context) using thresholding design.
By increasing the number of model parameters (by increasing k, or m, or by using block-diagonal versus diagonal covariance structures) we determine the effects of limited training data on the model robustness. Designing sex-dependent, length-based on the size of training data increases the complexity of the stochastic segment models. In all of these conditional models, except length-dependent models where no significant improvement is made, the additional information was important enough to compensate for the loss of robustness caused by increasing the number of parameters. The sex-dependent models have reduced the error number of the baseline system by 21%, and the context-dependent have reduced the error number by almost 20%.