The additional flexibility in terms of word sequences, prosodic realizations and pronunciations increases the search space and, consequently, the computational cost of the synthesis system. To address this problem this thesis also offers improvements to the popular unit selection approach for more accurately constraining or pruning the search space at the acoustic level. In particular, we describe a variation to the cluster-based unit database design aimed at constraining the set of candidate units, and we introduce splicing costs into the unit search criterion as a measure to indicate which unit boundaries are particularly good or poor join points, augmenting existing concatenation measures for better pruning of the search space. As a byproduct, the new splicing costs also lead to improvements in speech quality.
Finally, we introduce a modular speech synthesis system architecture where each component is represented with weighted finite-state transducers (WFSTs), and we describe specific WFST implementations of prosody prediction and unit selection modules. Such an architecture provides an efficient representation of flexible targets and allows the steps in the synthesis process to be performed with operations available in a general purpose toolbox.
The full thesis in Portable Document Format.
The full thesis in compressed Postscript.
See Ivan Bulyko's home page for other related publications.