Modeling Structure in Speech above the Segment for Spontaneous Speech Recognition
Current speech recognition technology, while useful in tightly
constrained domains with cooperative speakers, still leads to
unacceptably high error rates (30-50%) on unconstrained speech such as
the conversational speech in the Switchboard corpus and the radio news
broadcasts in the DARPA Hub4 task. An important difference between
these tasks and the controlled speech tasks that give 90% accuracy or
better is the much larger variability in speaking style even within
data from a single speaker. Since they do not account for the
systematic factors behind this variability, current acoustic models
must be ``broader,'' leading to more confusability among words and
hence high error rates. To address this problem, this work proposes
to improve acoustic models by representing structure in speech above
the level of the phonetic segment.
Specifically, the work involves modeling structure at three time
scales: the syllable, short regions or sequences of words within an
utterance, and the session or conversation (i.e. speaker). At the
syllable level, we are developing automatic clustering techniques for
high-dimensional context vectors in order to capture the affects of
syllable structure and longer contexts, as well as a parametric model
of temporal variability. A goal is to move from phone to syllable-size
units to facilitate modeling reduction phenomena where phone segments
are dropped but the associated gestures are still apparent in the
coarticulation effects on neighboring phones. At the region level,
the project involves incorporating a slowly varying hidden speaking
mode that is an indicator of systematic differences in pronunciations
associated with reduced vs. clearly articulated speech. The hidden
speaking mode is cued by acoustic features, such as speaking rate,
relative energy and relative pitch range, as well as
conversation-level word cues to information status. Currently, we are
investigating the dependence of word-based and acoustic variation on
dialog acts, such as statements, questions and back-channel
acknowledgements. At the speaker level, the effort involves using
hierarchical models of the correlation among speech sounds
(e.g. between an ``m'' and an ``n'' for the same speaker) to improve
adaptation of acoustic models. Such dependence models address the
problem seen in many applications that the acoustic space is sparsely
sampled because there is a small amount of data for adaptation. The
three efforts have in common the general themes of developing new
stochastic models that capture high-level structure, and using
automatic learning to help define both the structure and appropriate
regions of parameter tying for estimation. Automatic learning
includes linguistically-motivated feature extraction so the algorithms
benefit as much as possible from existing knowledge. All three
efforts build on a common framework of non-stationary segment
trajectory modeling, and the interactions between the different time
scales will be leveraged in an integration of the three components in
the final stage of the project.
Experiments involve large vocabulary recognition of conversational speech.
The effort relies on a multi-pass recognition search strategy,
which reduces the search space with standard hidden Markov models in
order to allow rescoring with the higher-order (and therefore more
computationally expensive) models developed here. Thus, the work
builds on strengths of existing speech recognition technology, while
exploring radically new knowledge sources (i.e. long-term dependence),
a combination that has the potential to significantly advance the
state of the art in speech recognition performance.
(April 1997 -- March 2000)
SPONSOR: NSF IRI-9618926
AWARD PERIOD: April 1997 - March 2001
TEAM MEMBERS:
PUBLICATIONS:
-
``Automatic Detection of Sentence Boundaries and Disfluencies based on
Recognized Words,'' A. Stolcke, E. Shriberg, R. Bates, M. Ostendorf,
D. Hakkani, M. Plauche, G. Tur and Y. Lu, Proceedings of the
International Conference on Spoken Language Processing, 1998,
vol.5, pp. 2247-2250.
-
``Moving beyond the `beads-on-a-string' model of speech,'' M. Ostendorf,
Proc. IEEE ASRU Workshop, 1999.
-
``Use of higher level linguistic structure in acoustic modeling
for speech recognition,'' I. Shafran and M. Ostendorf, Proceedings of the
International Conference on Acoustics, Speech and Signal Processing,
vol. III, pp. 1643-1646, 2000.
-
``Incorporating linguistic theories of phonological variation into
speech recognition models,'' M. Ostendorf, Phil. Trans. Royal Society,
vol. 358, no. 1769, pp. 1325-1338, 2000.
-
"Integrating Articulatory Features into Acoustic Models for Speech
Recognition", K. Kirchhoff, Proceedings Workshop PhonASR,
Saarbruecken, Germany, April 2000
-
"Speech Analysis by Rule Extraction from Trained Artificial Neural
Networks," K. Kirchhoff, Proceedings of International Conference on
Spoken Language Processing, Beijing, October 2000
-
"Clustering wide-contexts and HMM topologies for spontaneous speech recognition," Izhak Shafran, Ph.D. Thesis, 2001.
-
``A prosodically labeled database of spontaneous speech,''
M. Ostendorf, I. Shafran, S. Shattuck-Hufnagel, B. Byrne and
L. Carmichael, Proc. of the ISCA Workshop on Prosody in Speech
Recognition and Understanding, pp. 119-121, October 2001.
-
``Prosody and phonetic variability: lessons learned from acoustic
model clustering,'' I. Shafran, M. Ostendorf, and R. Wright,
Proc. of the ISCA Workshop on Prosody in Speech Recognition and
Understanding, pp. 127-131, October 2001.
-
``Reducing the Effects of Pronunciation Variability on Spontaneous
Speech Recognition using Prosody and Discourse,'' R. Bates and M. Ostendorf,
Proc. of the ISCA Workshop on Prosody in Speech Recognition and
Understanding, pp. 17-22, October 2001.
-
I. Shafran and M. Ostendorf, ``Acoustic Model Clustering Based on
Syllable Structure,'' submitted manuscript.
Return to the SSLI Lab Projects Page.