## Structural Alternatives in Stochastic Segment Modeling for
Speech Recognition

### Ibrahim M. Bechwati

The Stochastic Segment Model is a new approach used in continuous speech
recognition systems. It is a fixed length stochastic model used to
represent variable length speech segments. The segment model is based on
the assumption that speech segments have an unobserved trajectory in
*k*-dimensional feature space, which is modeled as a sequence of
*m k*-dimensional feature vectors. There is a trade-off between
improved recognition preformance associated with more complex models and
performance degradation associated with insufficient date for estimating
large numbers of parameters with a limited amount of training data. The
length of the sequence (*m*), the number of features (*k*), the
number of models per phone and other assumptions determine the complexity
and the robustness of the model for a giving training data.
The goal of this work was to improve recognition performance by investigating
different model structures to achieve a balance between complexity and
robust parameter estimates. In order to achieve that we have investigated:
(1) robust parameter estimation for different numbers of parameters in
context-independent models; (2) speaker-sex and phone length-dependent
conditional models; and (3) context-dependent models (conditioning on left
and right phoneme context) using thresholding design.

By increasing the number of model parameters (by increasing *k*, or
*m*, or by using block-diagonal versus diagonal covariance structures)
we determine the effects of limited training data on the model robustness.
Designing sex-dependent, length-based on the size of training data increases
the complexity of the stochastic segment models. In all of these conditional
models, except length-dependent models where no significant improvement is
made, the additional information was important enough to compensate for the
loss of robustness caused by increasing the number of parameters. The
sex-dependent models have reduced the error number of the baseline system by
21%, and the context-dependent have reduced the error number by almost 20%.

Return to the SSLI Lab Graduate Students Theses Page.