Recent interest in the automatic processing of meetings is motivated by a desire to summarize, browse, and retrieve important information from lengthy archives of spoken data. One of the most useful capabilities such a technology could provide is a way for users to locate ``hot spots'' or regions in which participants are highly involved in the discussion (e.g. heated arguments, points of excitement, etc.). We ask two questions about hot spots in meetings in the ICSI Meeting Recorder corpus. First, we ask whether involvement can be judged reliably by human listeners. Results show that despite the subjective nature of the task, raters show significant agreement in distinguishing involved from non-involved utterances. Second, we ask whether there is a relationship between human judgments of involvement and automatically extracted prosodic features of the associated regions. Results show that there are significant differences in both F0 and energy between involved and non-involved utterances. These findings suggest that humans do agree to some extent on the judgment of hot spots, and that acoustic-only cues could be used for automatic detection of hot spots in natural meetings.
--
TBA
--
Mutual information has been useful in many areas, from Bayesian network structure learning to spike train analysis. I will talk about the Mutual Information Toolkit, a set of tools to compute information theoretic quantities on very large data sets. I will briefly go over some issues involved in the estimation of the mutual information and talk about some of its applications.
--
The most prevalent approach to language modeling is n-grams, i.e. counting occurences of n-consecutive words. Although such an approach is intuitive and has proven to be succesfull in practice, it suffers from the central problem of the exponential increase in the number of parameters as n increases. In this work, a continuous, real-valued vector representation is associated with each word or groups of words, which allows us to apply parametric estimation techniques, like mixture of Gaussians, to estimate the required distributions. Associating a continuous vector representation with a word has a number of advanatages. First, continuous models may generalize better in unseen events than discrete ones. Second, with parametric models we can tune the number of parameters at a much finer level than with discrete language models and therefore possibly avoid overfitting. Third, parametric language models can be much smaller in size. Representing words with continuous vectors is not an entirely new concept in the language modeling community but the adaptation abilities of such models have been overlooked. The continuous models can be adapted to new domains/styles using a linear transformation rather than interpolation techniques which is the standard procedure for language model adaptation. In the task of acoustic model adaptation, using a linear transformation has proven to be superior to interpolation techniques when adaptation data are limited. In this on-going work I explore the adaptation abilities of such models, when GMM are used to model the feature space.
--
To support summarization of automatically transcribed meetings, we introduce a classifier to recognize agreement or disagreement utterances, utilizing both word-based and prosodic cues. We show that hand-labeling efforts can be minimized by using unsupervised training on a large unlabeled data set combined with supervised training on a small amount of data. For ASR transcripts with over 45\% WER, the system recovers nearly 80\% of agree/disagree utterances with a confusion rate of only 3\%.
--
The majority of current work in automatic speech recognition (ASR) employs data-driven pattern recognition methods. A major constraint on the efficacy of these systems is the amount of data available to train the models. However, acquiring new data is often limited by the amount of available resources. Additionally, previous results have shown that not all data is equally suitable for training ASR systems. Therefore, being able to choose which data is more important for system performance would allow researchers to allocate resources for transcribing new data more effectively. In this work, we explore methods based on likelihood measures and ideas from child language acquisition to identify which data would be most useful for training ASR models. We then apply selective sampling to ASR model training using different selection criteria, evaluating the resulting systems for conversational speech recognition. A small gain is obtained by using prosody measures in selection.
--
In this talk I will discuss some of the work I did while at LIMSI. Most of my work was based on the paper Error Corrective Mechanisms For Speech Recognition (Mangu and Padmanabhan, ICASSP 2001). I will describe the approach presented in this paper and my efforts to apply it to the LIMSI Hub-5 (Switchboard) recognizer. This approach uses Transformation-Based Learning to automatically learn rules to correct common mistakes made by a speech recognizer.
--
Speech synthesis has changed dramatically in the past few years to have a corpus-based focus, borrowing heavily from advances in automatic speech recognition. In this talk, we survey technology in speech recognition systems and how it translates (or doesn't translate) to speech synthesis systems. We further speculate on future areas where ASR may impact synthesis and vice versa.
--