The ``worst case'' attribute of Gaussian vectors for data compression/source coding originally developed by Sakrison and Lapidoth using Shannon rate-distortion theory is developed using the high rate quantization theory of Bennett, Zador, and Gersho and extended to Gauss mixtures, providing an approach to robust data compression for nonGaussian sources such as images. The analysis provides several interesting side results, including a new interpretation of the minimum discrimination information distortion (MDI) measure and its application to clustering models and constructing Gauss mixture models based on training data and a variation on the minimum description length (MDL) principle for continuous distributions. High rate quantization theory provides a mathematical connection between the distortion and the performance of a classified vector quantizer for nonGaussian data designed using Gaussian distributions. Although the primary application is compression and classification, several ideas relating maximum entropy density estimation, the MAXDET problem, and Markov mesh random fields arise in the analysis. At this time no experimental evidence exists that the approach works for image coding, the motivation for the theory, but the theory provides a hindsight explanation for why CELP speech coders work as well as they do.
Speaker adaptation is recognized as an essential part of today's large vocabulary automatic speech recognition systems. A family of techniques that has been extensively applied for limited adaptation data is transformation-based adaptation. In transformation-based adaptation we partition our parameter space in a set of classes, estimate a transform (usually linear) for each class and apply the same transform to all the components of the class. It is known however that additional gains can be made if we do not constrain the components of each class to use the same transform. In this work two speaker adaptation algorithms are described. In the first half of this work instead of estimating one linear transform for each class (as Maximum Likelihood Linear Regression (MLLR) does, for example) we estimate multiple linear transforms per class of models and a transform weights vector which is specific to each component (Gaussians in our case). This in effect means that each component receives its own transform without having to estimate each one of them independently. This scheme, termed Maximum Likelihood Stochastic Transformations (MLST) achieves a good trade-off between robustness and acoustic resolution and it was proven superior to MLLR for 10 adaptation sentences or more. The algorithm is evaluated on the Wall Street Journal (WSJ) corpus for non-native speakers and it is shown that in the case of 40 adaptation sentences the algorithm outperforms MLLR by more than 13%. In the second half of this work, we introduce a variant of the MLST designed to operate under sparsity of data. Since the majority of the adaptation parameters are the transformations, we estimate them on the training speakers and adapt to a new speaker by estimating the transform weights only. First we cluster the speakers in a number of sets and estimate the transformations on each cluster. The new speaker will use transformations from all clusters to perform adaptation. This method termed Basis Transformations can be seen as a speaker similarity scheme. Experimental results on the WSJ show that when Basis Transformations is cascaded with MLLR marginal gains can be obtained from MLLR only, for adaptation of native speakers.
The constant frame length in typical ASR front ends is too long to capture transient phenomena in speech, such as stop bursts. However, current HMM systems have consistently outperformed systems based solely on non-uniform units. This work investigates an approach to ``add back'' such transient information to a speech recognizer, without losing the robustness of the standard acoustic models. We demonstrate a set of phonetically-motivated acoustic features that discriminate a preliminary test set of highly ambiguous voiceless stops in CV contexts. The features are automatically computed from data that had been hand-marked for consonant burst location and voicing onset (extension to automatic marking is also proposed). Two corpora are processed using a parallel set of features: conversational speech over the telephone (Switchboard), and a corpus of carefully elicited speech. The latter provides an upper bound on discrimination, and allows for comparison of feature usage across speaking style. We explore data-driven approaches to obtaining variable-length time-localized features compatible with an HMM statistical framework. We also suggest techniques for extension to automatic annotation of burst location, for computation of features at such points, and for augmentation of an HMM system with the added information.(This is joint work with Madelaine Plauche, Elizabeth Shriberg and Horacio Franco.)
We describe a technique for analyzing the output of a speech recognizer for the purpose of detecting misrecognition of spoken input. Our technique combines a wide range of diverse knowledge sources, including part-of-speech tags, word confidence scores, name lists, and bigram language models, into a unified probabilistic framework. This framework can be used to jointly identify word errors and semantic phrases, such as names, in speech recognition output. We describe extensions to the framework for identifying different types of errors, such as those caused by out-of-vocabulary input words.
Hidden Markov Models (HMMs) have been successful for modelling the dynamics of carefully dictated speech, but their performance degrades severely when used to model conversational speech. This talk will present a preliminary feasibility study of an alternative class of models: loosely coupled, or factorial, HMMs. Since speech is produced by a system of loosely coupled articulators, stochastic models explicitly representing this parallelism may have advantages for automatic speech recognition (ASR), particularly when trying to model the phonological effects inherent in casual spontaneous speech. The talk will present results for one specific coupled model on a simple ASR task, using both exact and approximate estimation schemes, and concludes that this class of models merits further investigation.
Understanding how people speak when they interact with spoken dialogue systems is critical to improving the performance of those systems. In particular, speakers' prosodic behavior provides useful indicators of a) whether a speaker turn will be recognized correctly or not by an automatic speech recognition (ASR) system; b) whether a speaker is reacting to a system error; and c) whether a speaker is correcting such an error. From a practical perspective, previous research has found that user attempts to correct system errors are themselves more likely to be *mis*recognized than other utterances, and thus may require special handling. Knowing whether speakers are more likely to repeat or rephrase their utterances, add new information or shorten their input, and how system behavior influences these choices can suggest appropriate on-line modifications to a dialogue system's interaction strategy or to the recognition procedures it employs. This talk will present results of analyses of lexical and prosodic characteristics of human interactions with the TOOT spoken dialogue system, an experimental system for accessing train schedules over the web that combines automatic speech recognition (ASR), text-to-speech, and a telephone interface. It will suggest, in particular, how prosodic information may prove important in controlling human-machine interactions.(This is joint work with Julia Hirschberg and Diane Litman.)
Clusters have been one of the staples of language modeling research for almost as long as there has been language modeling research. I will give a novel clustering approach that allows us to create smaller models, and to train maximum entropy models faster. First, I examine how to use clusters for language model compression, with a surprising result. I achieve my best results by first making the models larger using clustering, and then pruning them. This can result in a factor of three or more reduction in model size at the same perplexity. I then go on to examine a novel way of using clustering to speed up maximum entropy training. Maximum entropy is considered by many people to be one of the more promising avenues of language model research, but it is prohibitively expensive to train large models. I show how to use clustering to speed up training time by up to a factor of 35 over standard techniques, while slightly improving perplexity. The same approach can be used to speed up some other learning algorithms that try to predict a very large number of outputs.
Markov chain Monte Carlo (MCMC) refers to a particular type of numerical algorithm for evaluating integrals of functions with respect to awkward, usually very high-dimensional, probability distributions. The basic idea is to design a Markov chain whose limiting distribution is the distribution of interest and then to use simulation of the chain to estimate the required integrals via corresponding sample averages. This talk will introduce the topic and outline some of the advances over the last 50 years in analyzing extremely complex stochastic systems. Some milestones include Metropolis' method (1953), Hastings' algorithm (1970), the Gibbs sampler (1976, 1984), simulated annealing (1983), MCMC maximum likelihood (1984, 1991), Swendsen-Wang (1987), multigrid MCMC (1988), MCMC p-values (1989), simulated tempering (1992), reversible jumps (1995), coupling from the past (1996) and other more recent methods of perfect MCMC. In Bayesian inference, MCMC has become a standard computational engine over the past decade.
As efforts to create voice enabled interfaces and applications proliferate rapidly, the problem of speech enhancement in noisy, reverberant real world environments is acquiring crucial importance. I will describe a new technique for the removal of environmental distortions from speech signals. This technique is based on a unified probabilistic framework, which transforms the enhancement problem into Bayes-optimal signal estimation. Key points in our approach are the use of a strong speech model, and the leveraging of variational techniques and conjugate priors to derive efficient algorithms. Results obtained using this technique are substantially better than standard methods.Joint work with John Platt and Alex Acero.