Speech Generation for Human-Computer Interaction

This project addresses the problem of computer speech generation for human-computer interaction using spoken language, with the goal of improving speech synthesis quality by controlling prosodic parameters based on text generation outputs. The research will investigate both utterance-level and dialog-level control of prosody, developing models and associated automatic training algorithms aimed at portability to different task domains and different generators. With the dual objectives of advancing the state of the art and providing general software tools, the effort will include linguistic inquiry and statistical modeling research as well as a software engineering component. Working with a commercially available synthesizer and building on existing prosody synthesis and recognition algorithms, the research will involve: 1)~collection of read and spontaneous speech corresponding to task-specific responses, 2)~improving automatic labeling of prosodic patterns and training of prediction modules; 3)~use of syntactic, semantic and discourse annotation available from text generation systems to drive prosodic control modules and thereby improve the quality of the synthesized computer speech response; and 4)~investigation of the role/effectiveness of prosody in computer response for guiding the dialog, e.g. for marking clarification subdialogs and other types of system initiative. To ensure that the goal of portability is achieved, the synthesized responses will be evaluated with multiple generators and on at least two different task domains; thus an important component of the work is development of evaluation protocols for assessing speech generation quality and the impact on human-computer interaction. By making using of the rich linguistic information available from text generation, the research will benefit spoken language technology that currently uses synthesis in a text-to-speech generation mode. In addition, it will provide a new capability in systems that use no spoken response generation, opening up application areas such as telephone-based computer access and potentially changing the face of multi-media interactions. Moreover, the results of the investigations of prosodic marking of dialog and information structure and lessons learned from system evaluation work will have implications for improving text generation and dialog management technology, as well as prosody and synthesis research.

NSF, May 1996 - December 2001
ARPA and ONR, September 1996 - December 1997
DARPA, February 1999 - January 2000



Return to the SSLI Lab Projects Page.