Speech Generation for Human-Computer Interaction
This project addresses the problem of computer speech generation for
human-computer interaction using spoken language, with the goal of
improving speech synthesis quality by controlling prosodic parameters
based on text generation outputs. The research will investigate both
utterance-level and dialog-level control of prosody, developing models
and associated automatic training algorithms aimed at portability to
different task domains and different generators. With the dual
objectives of advancing the state of the art and providing general
software tools, the effort will include linguistic inquiry and
statistical modeling research as well as a software engineering
component. Working with a commercially available synthesizer and
building on existing prosody synthesis and recognition algorithms, the
research will involve: 1)~collection of read and spontaneous speech
corresponding to task-specific responses, 2)~improving automatic
labeling of prosodic patterns and training of prediction modules;
3)~use of syntactic, semantic and discourse annotation available from
text generation systems to drive prosodic control modules and thereby
improve the quality of the synthesized computer speech response; and
4)~investigation of the role/effectiveness of prosody in computer
response for guiding the dialog, e.g. for marking clarification
subdialogs and other types of system initiative. To ensure that the
goal of portability is achieved, the synthesized responses will be
evaluated with multiple generators and on at least two different task
domains; thus an important component of the work is development of
evaluation protocols for assessing speech generation quality and the
impact on human-computer interaction. By making using of the rich
linguistic information available from text generation, the research
will benefit spoken language technology that currently uses synthesis
in a text-to-speech generation mode. In addition, it will provide a
new capability in systems that use no spoken response generation,
opening up application areas such as telephone-based computer access
and potentially changing the face of multi-media
interactions. Moreover, the results of the investigations of prosodic
marking of dialog and information structure and lessons learned from
system evaluation work will have implications for improving text
generation and dialog management technology, as well as prosody and
NSF, May 1996 - December
ARPA and ONR, September 1996 - December 1997
February 1999 - January 2000
- Senior Staff:
Prof. Mari Ostendorf, Principal Investigator
- Graduate students:
Ivan Bulyko, PhD 2002 (UW)
Prosody generation and unit selection for concatenative speech synthesis
Cameron Fordyce, MS 1998 (BU)
Prediction of prosody markers using error-driven learning
Steve Juranich, MS candidate (UW)
Computational models of intonation
Also supported in part: Tianshu Zhou (BU), Pete Gilchrist (UW)
P. Sean Wheeler, BS 1998
Masaki Horii, BS candidate
I. Bulyko and M. Ostendorf, "Unit
Selection for Speech Synthesis Using Splicing Costs with Weighted Finite
State Transducers", In Proc. of Eurospeech, 2:987-990, 2001.
I. Bulyko and M. Ostendorf, "Joint
Prosody Prediction and Unit Selection for Concatenative Speech Synthesis",
In Proc. of ICASSP, 2001.
I. Bulyko and M. Ostendorf, "Predicting
Gradient F0 Variation: Pitch Range and Accent Prominence", In Proc.
of EuroSpeech, vol. 4, p. 1819-1822, 1999.
I. Bulyko, M. Ostendorf and P. Price, "On
the Relative Importance of Different Prosodic Factors in Improving Speech
Synthesis", In Proc. of ICPhS, vol. 1, pp. 81-84,1999.
``Prosody Prediction for Speech Synthesis using Transformational Rule-based
Learning,'' C. Fordyce and M. Ostendorf, Proceedings of the International
Conference on Spoken Language Processing, 1998, vol. 3, pp. 843-846.
``SABLE: A Standard for TTS Markup,'' R. Sproat, A. Hunt, M. Ostendorf,
P. Taylor, A. Black, K. Lenzo, and M. Edgington, Proceedings of the
International Conference on Spoken Language Processing, 1998, vol.
5, pp. 1719-1722.
Control of prosodic patterns for speech generation in human-computer
dialogs, Cameron Fordyce, Boston University M.S. Thesis 1998
Return to the SSLI Lab Projects Page.