Flexible Speech Synthesis Using Weighted Finite State Transducers


Ivan Bulyko


Abstract

The main focus of this thesis is on improving the quality of concatenative speech synthesis by taking advantage of the natural (allowable) variability in spoken language, namely, the fact that there are multiple ways of uttering a given sentence and there are several word sequences that can represent a given concept. An architecture for speech generation for constrained domain applications is proposed that tightly integrates language generation and speech synthesis, allowing the choice of words and desired intonation in the system's response to be optimized jointly with the speech output quality. Experiments with a travel planning dialog system have demonstrated that by expanding the space of candidate responses and possible prosodic realizations we achieve higher quality speech output.

The additional flexibility in terms of word sequences, prosodic realizations and pronunciations increases the search space and, consequently, the computational cost of the synthesis system. To address this problem this thesis also offers improvements to the popular unit selection approach for more accurately constraining or pruning the search space at the acoustic level. In particular, we describe a variation to the cluster-based unit database design aimed at constraining the set of candidate units, and we introduce splicing costs into the unit search criterion as a measure to indicate which unit boundaries are particularly good or poor join points, augmenting existing concatenation measures for better pruning of the search space. As a byproduct, the new splicing costs also lead to improvements in speech quality.

Finally, we introduce a modular speech synthesis system architecture where each component is represented with weighted finite-state transducers (WFSTs), and we describe specific WFST implementations of prosody prediction and unit selection modules. Such an architecture provides an efficient representation of flexible targets and allows the steps in the synthesis process to be performed with operations available in a general purpose toolbox.

The full thesis in Portable Document Format.

The full thesis in compressed Postscript.

See Ivan Bulyko's home page for other related publications.


Return to the SSLI Lab Graduate Students Theses Page.