Control of prosodic patterns for speech generation in human-computer dialogs

Cameron Fordyce

Current speech synthesis systems produce intelligible output under conditions with low background noise and low cognitive load. However, the quality is far from natural and intelligibility degrades significantly under less than ideal conditions. One widely agreed upon area of improvement in speech synthesis output is prosody. Prosody includes the acoustic characteristics of speech that communicate important syntactic, semantic, and discourse information about the utterance. The acoustic correlate of prosody are the pauses, fundamental frequency contours, energy, and duration changes of utterances.

    Typically, prosody synthesis is a two step proces, where symbolic prosodic labels such as phrase boundaries and relative emphasis are predicted from annotated text and then the acoustic correlates are predicted from these labels combined with phonetic information. The goal of this research is to improve the prediction of symbolic prosodic labels for text-to-speech systems, specifically, location of phrase boundaries and phrase-level emphasis (i.e. pitch accents). To date, the most successful algorithms for predicting symbolic prosodic labels are based on either handwritten rules or statistical methods. This research will adopt and modify an alternative algorithm: transformational rule-based learning (TRBL), which has had success in many natural language processing tasks. This learning algorithm is automatically trainable like statistical methods, but is less sensitive to sparse training data conditions than these methods. A second contribution of the thesis is an analysis of the interaction of phrase and accent symbols in prediction. Previous approaches have predicted these prosodic events in a serial fashion, but the order is not agreed upon. In this study, we compare serial prediction with the two possible orders, as well as explore joint prediction. To facilitate joint prediction, the TRBL learning algorithm is combined with a multi-level feature and label representation.

    Experimental studies were conducted on a prosodically labeled corpus of radio news speech, predicting presence vs. accent at the syllable level and three levels of phrase breaks (none, major, minor) at the word level. First, TRBL was compared to the most popular statistical method, decision trees, for the task of pitch accent location prediction with known phrase boundaries, where TRBL gave a small improvement in prediction accuracy over decision trees. Second, a distance-based metric for phrase prediction design was proposed and evaluated, arguing that this metric better describes the linguistic differences between different levels of phrase breaks than does an exact match caccuracy. The results showed that the new metric results in higher prediction rates for minor phrase boundaries. Next, experiments were conducted to assess prediction but not vice versa and therefore that phrase structure should be predicted first. Experiments also showed that the performance loss associated with using predicted vs. actual spoken phrase boundaries in accent prediction can be almost entirely regained when using training data labeled with predicted boundaries. A final experiment compares serial prediction with the joint prediction of pitch accents and phrase boundaries using TRBL, finding no advantage to joint prediction.

The full thesis in postscript format. (1.11 MB)

Return to the SSLI Lab Graduate Students Theses Page.