Evaluating the Use of Prosodic Information in Speech Recognition and Understanding

Prosody marks information structure in speech via phrasing, relative prominence and the tones that mark them, and our goal is to develop algorithms for automatically detecting these cues and for using them to improve speech understanding accuracy. Associated with this goal are a host of technical challenges ranging from finding the mapping from prosody to meaning to modeling the multiple levels of interaction in prosody and distinguishing fluent prosodic phrase endings from disfluent pauses. Our approach was multi-disciplinary, combining linguistic theory, speech knowledge and statistical modeling techniques. The research involved: 1) determining a representation of prosodic information suitable for use in speech understanding systems, 2) conducting distributional and acoustic analyses of speech corpora to better understand prosodic phenomena and define the structure of the computational models, 3) developing reliable algorithms for detection of the prosodic markers in speech, 4) investigating architectures for integrating prosodic cues in speech understanding systems, and 5) assessing potential performance improvements by evaluating prosody algorithms in an actual spoken language system (SLS). An important aspect of the approach was post-recognition prosody processing that provides better duration cues and allows for conditioning on segmental effects. Prosody is used as a supplementary knowledge source, providing information not available from the words alone for evaluating language interpretation hypotheses. The project investigated three different aspects of prosody: the marking of prominent syllables and phrase boundaries and the relationship of these cues to syntactic structure, the association of prosodic features with disfluencies in spontaneous speech, and the use of prosody as a cue to higher level dialog structure. Specific contributions of this project include:

(August 1989 -- December 1996)

SPONSORS: National Science Foundation and ARPA, NSF IRI-8905249/IRI-9248730


The following recent publications and presentations were supported at least in part by this grant. Other publications supported by this grant are listed on the publications page.

A. Stolcke and E. Shriberg, ``Statistical language modeling for speech disfluencies,''Proc. ICASSP, I:405-408, 1996.

P. Price and M. Ostendorf, ``Combining Linguistic with Statistical Methods in Modeling Prosody,'' in Signal to syntax: Bootstrapping from speech to grammar in early acquisition, J. L. Morgan and K. Demuth (Eds.), pp. 67-83, Hillsdale, NJ: Lawrence Erlbaum Associates, 1996.

E. Shriberg and A. Stolcke, ``Word predictability after hesitations: A corpus-based study,'' Proc. ICSLP, 1996.

E. Shriberg, D. R. Ladd, and J. Terken, ``Modeling intra-speaker pitch range variation: Predicting F0 targets when `speaking up', '' Proc. ICSLP, 1996.

M.H. Siu, M. Ostendorf, and H. Gish, ``Modeling Disfluencies in Conversational Speech,'' Proc. ICSLP, 1996.

L. Dilley, S. Shattuck-Hufnagel and M. Ostendorf, ``Glottalization of Vowel-Initial Syllables as a Function of Prosodic Structure,'' J. Phonetics, 24 423-444, 1996.

M. Ostendorf and K. Ross, ``A Multi-Level Model for Recognition of Intonation Labels,'' in Computing Prosody, Y. Sagisaka, N. Campbell and N. Higuchi (Eds.), 291-308, Springer-Verlag, NY: 1997.

M. Swerts and M. Ostendorf, ``Prosodic Indications of Discourse Structure in Human-Machine Interactions,'' Speech Communications, v. 22, 1997.

M. Ostendorf, ``Linking Speech Recognition and Language Processing Through Prosody,'' CC-AI, to appear.

E. E. Shriberg, R.A. Bates and A. Stolcke, ``A prosody-only decision-tree model for disfluency detection,'' Proc. EUROSPEECH, 1997.

k M. Siu and M. Ostendorf, ``Variable N-gram Language Modeling and Extensions for Conversational Speech," Proc. EUROSPEECH, 1997.

Return to the SSLI Lab Projects Page.