Evaluating the Use of Prosodic Information in Speech
Recognition and Understanding
Prosody marks information structure in speech via phrasing, relative
prominence and the tones that mark them, and our goal is to develop
algorithms for automatically detecting these cues and for using them
to improve speech understanding accuracy. Associated with this goal
are a host of technical challenges ranging from finding the mapping
from prosody to meaning to modeling the multiple levels of interaction
in prosody and distinguishing fluent prosodic phrase endings from
disfluent pauses. Our approach was multi-disciplinary, combining
linguistic theory, speech knowledge and statistical modeling
techniques. The research involved: 1) determining a representation of
prosodic information suitable for use in speech understanding systems,
2) conducting distributional and acoustic analyses of speech corpora
to better understand prosodic phenomena and define the structure of
the computational models, 3) developing reliable algorithms for
detection of the prosodic markers in speech, 4) investigating
architectures for integrating prosodic cues in speech understanding
systems, and 5) assessing potential performance improvements by
evaluating prosody algorithms in an actual spoken language system
(SLS). An important aspect of the approach was post-recognition
prosody processing that provides better duration cues and allows for
conditioning on segmental effects. Prosody is used as a supplementary
knowledge source, providing information not available from the words
alone for evaluating language interpretation hypotheses.
The project investigated three different aspects of
prosody: the marking of prominent syllables and phrase boundaries
and the relationship of these cues to syntactic structure, the
association of prosodic features
with disfluencies in spontaneous speech, and the
use of prosody as a cue to higher level dialog structure.
Specific contributions of this project include:
- Transcription systems: Developed and documented a system
for prosodic transcription and a system
for disfluency transcription -- both of
which have influenced other transcription efforts -- and used
these systems to label various corpora.
- Prosodic phrases and prominences: Analyzed the relationship
between symbolic prosodic events (phrases and prominences) and
syntactic structure and the acoustic cues
to these events; developed algorithms
for detecting such prosodic events
and architectures for using them to improve parsing accuracy and/or
speed; and developed a model of duration
for use in speech recognition that combines prosodic and phonetic
- Disfluencies: Conducted distributional analyses to
determine important classes of disfluencies;
determined acoustic cues to some of these classes; developed algorithms
for detecting disfluencies from acoustic and textual cues;
and investigated mechanisms for
accounting for the presence of disfluencies in language modeling for
- High level structure: Studied acoustic and textual cues
to discourse structure in human-computer dialogs;
and analyzed the acoustic cues to speaking ``style'' with the goal
of systematically modeling regions of phonetic reduction in pronunciation.
(August 1989 -- December 1996)
SPONSORS: National Science Foundation and ARPA, NSF IRI-8905249/IRI-9248730
The following recent publications and presentations were supported
at least in part by this grant. Other publications supported by
this grant are listed on the
A. Stolcke and E. Shriberg,
``Statistical language modeling for speech
disfluencies,''Proc. ICASSP, I:405-408, 1996.
P. Price and M. Ostendorf, ``Combining Linguistic with
Statistical Methods in Modeling Prosody,'' in Signal to syntax:
Bootstrapping from speech to grammar in early acquisition,
J. L. Morgan and K. Demuth (Eds.), pp. 67-83, Hillsdale, NJ: Lawrence
Erlbaum Associates, 1996.
E. Shriberg and A. Stolcke,
``Word predictability after hesitations: A corpus-based study,''
Proc. ICSLP, 1996.
E. Shriberg, D. R. Ladd, and J. Terken, ``Modeling intra-speaker pitch range variation: Predicting F0 targets
when `speaking up', '' Proc. ICSLP, 1996.
M.H. Siu, M. Ostendorf, and H. Gish, ``Modeling Disfluencies in
Conversational Speech,'' Proc. ICSLP, 1996.
L. Dilley, S. Shattuck-Hufnagel and M. Ostendorf, ``Glottalization of
Vowel-Initial Syllables as a Function of Prosodic Structure,''
J. Phonetics, 24 423-444, 1996.
M. Ostendorf and K. Ross, ``A Multi-Level Model for Recognition of Intonation
Labels,'' in Computing Prosody, Y. Sagisaka,
N. Campbell and N. Higuchi (Eds.), 291-308, Springer-Verlag, NY: 1997.
M. Swerts and M. Ostendorf, ``Prosodic Indications of Discourse Structure
in Human-Machine Interactions,'' Speech Communications, v. 22, 1997.
M. Ostendorf, ``Linking Speech Recognition and Language Processing
Through Prosody,'' CC-AI, to appear.
E. E. Shriberg, R.A. Bates and A. Stolcke, ``A prosody-only decision-tree model
for disfluency detection,'' Proc. EUROSPEECH, 1997.
M. Siu and M. Ostendorf, ``Variable N-gram Language Modeling and
Extensions for Conversational Speech," Proc. EUROSPEECH, 1997.
Return to the SSLI Lab Projects Page.