Control of prosodic patterns for speech generation in human-computer dialogs
Cameron Fordyce
Current speech synthesis systems produce intelligible output under conditions
with low background noise and low cognitive load. However, the quality is far
from natural and intelligibility degrades significantly under less than ideal
conditions. One widely agreed upon area of improvement in speech synthesis
output is prosody. Prosody includes the acoustic characteristics of speech
that communicate important syntactic, semantic, and discourse information about
the utterance. The acoustic correlate of prosody are the pauses, fundamental
frequency contours, energy, and duration changes of utterances.
    Typically, prosody synthesis is a two step proces, where symbolic
prosodic labels such as phrase boundaries and relative emphasis are predicted
from annotated text and then the acoustic correlates are predicted from these
labels combined with phonetic information. The goal of this research is to
improve the prediction of symbolic prosodic labels for text-to-speech systems,
specifically, location of phrase boundaries and phrase-level emphasis (i.e.
pitch accents). To date, the most successful algorithms for predicting symbolic
prosodic labels are based on either handwritten rules or statistical methods.
This research will adopt and modify an alternative algorithm: transformational
rule-based learning (TRBL), which has had success in many natural language
processing tasks. This learning algorithm is automatically trainable like
statistical methods, but is less sensitive to sparse training data conditions
than these methods. A second contribution of the thesis is an analysis of the
interaction of phrase and accent symbols in prediction. Previous approaches
have predicted these prosodic events in a serial fashion, but the order is not
agreed upon. In this study, we compare serial prediction with the two possible
orders, as well as explore joint prediction. To facilitate joint prediction,
the TRBL learning algorithm is combined with a multi-level feature and label
representation.
    Experimental studies were conducted on a prosodically labeled corpus
of radio news speech, predicting presence vs. accent at the syllable level and
three levels of phrase breaks (none, major, minor) at the word level. First,
TRBL was compared to the most popular statistical method, decision trees, for
the task of pitch accent location prediction with known phrase boundaries, where
TRBL gave a small improvement in prediction accuracy over decision trees. Second,
a distance-based metric for phrase prediction design was proposed and evaluated,
arguing that this metric better describes the linguistic differences between
different levels of phrase breaks than does an exact match caccuracy. The
results showed that the new metric results in higher prediction rates for minor
phrase boundaries. Next, experiments were conducted to assess prediction but
not vice versa and therefore that phrase structure should be predicted first.
Experiments also showed that the performance loss associated with using predicted
vs. actual spoken phrase boundaries in accent prediction can be almost entirely
regained when using training data labeled with predicted boundaries. A final
experiment compares serial prediction with the joint prediction of pitch accents
and phrase boundaries using TRBL, finding no advantage to joint prediction.
The full thesis in postscript format. (1.11 MB)
Return to the SSLI Lab Graduate Students Theses Page.