University of Washington
Department of Electrical Engineering
Speech Processing Seminar

Spring Quarter, 2001
1:00-2:00PM Wednesdays
RM EE1 026 New EE Bldg
University of Washington, Seattle
(unless otherwise noted)

26 April, 2001, Thursday
(Special Seminar, 1:30-2:30pm, RM 003 EE/CSE Bldg)
Speaking of the Future - Towards Free Form Dialog with Machines
-- David Nahamoo, Human Language Technologies, IBM Research

Abstract
In the next decade, we will experience a major change in the way we interact with machines. Through the convergence of the web and wireless telephony, we will have access to services and applications anywhere, anytime, and on any device. We will accomplish our tasks through free form conversational dialog with services and applications using multimodal interaction. In this talk, we will discuss the current state of conversational technologies including speech recognition, text-to-speech, dialog, natural language understanding, speech biometrics, and multimodal interaction through technology demonstrations and examine what can be expected from the technology in the next few years.
2 May, 2001, Wednesday
(1:00-2:00pm, RM 026 EE/CSE Bldg)
The Information Geometry of EM Variants for Speech and Image Processing
-- Asela Gunawardana, Johns Hopkins University

Abstract
The Expectation Maximization (EM) algorithm is an iterative technique used in many applications such as speech recognition and medical imaging, where it is employed to obtain statistical estimates from incomplete observations of the variables of interest. We analyze the EM algorithm using the information geometry of Csiszar and Tusnady, in order to understand how it may be extended. In this framework, the EM algorithm is viewed as the alternating minimization of the Kullback-Leibler information divergence between a family of statistical models and a desired family of distributions defined by the observed data. Thus, an iteration of the EM algorithm consists of a forward projection from the model family onto the desired family, followed by a backward projection from the desired family to the model family. The well-known GEM variant of the EM algorithm corresponds to replacing the backward projection with a step that reduces the divergence rather than minimizing it. Our contribution lies in showing that the convergence properties of the EM algorithm are retained when the forward projection is similarly extended, and that the desired family and the divergence can each be extended to yield useful estimation schemes. Such extensions of the EM algorithm yield a proof of convergence for the incremental EM algorithm of Neal and Hinton, and novel EM variants for estimation from small amounts of data and for estimation in the presence of outliers. The application of these EM variants to the problems of hidden Markov model estimation for automatic speech recognition, as well as to positron emission tomography will be discussed.
23 May 2001, Wednesday
(11-12, RM M406)
Online Unsupervised Adaptation in Speaker Verification
-- Larry Heck, Nuance

Abstract
In this talk, I will present a new approach to on-line unsupervised adaptation in speaker verification. The approach extends previous work by (1) improving performance on the enrollment handset-type when adapting on a different handset-type (e.g., improving performance on cellular when adapting on a landline office phone), (2) accomplishing this cross channel improvement without increasing the size of the speaker model after adaptation, (3) employing a count-based, parameter-dependent smoothing algorithm that emphasizes the use of mean parameters in the speaker models until sufficient adaptation data are present to accurately estimate variances, and (4) developing a new confidence-based adaptation update weight which minimizes the corrupting effects on the speaker models from impostor attacks. Experimental results were completed on a gender-balanced database of Japanese digits with 5222 speaker models across mixed channel conditions (landline and cellular). After adaptations on 8 separate phone calls with a single 8-digit utterance per call and a 12.5% impostor attack rate, the EER was reduced by 61% (rel.) using the new unsupervised adaptation approach. This compares favorably to the (optimal) 84\% reduction in EER resulting from supervised adaptation.
30 May 2001, Wednesday
(1-2, RM TBA)
Minimum Bayes-risk Automatic Apeech Recognition
-- Bill Byrne, Johns Hopkins University

Abstract
Automatic speech recognition (ASR) systems are being deployed in diverse tasks such as human to machine dialogue, language acquisition by non-native speakers, indexing and retrieval of multi-lingual audio information, and even assistance to individuals with speech impairment. In observing the variety of uses to which ASR is put, the question arises whether a uniform ASR architecture is equally useful for all applications. It may be possible to improve application specific performance of the ASR systems by adopting a framework that allows construction of task dependent recognizers. We will discuss the minimum Bayes-risk (MBR) classification framework as a means of building application dependent recognizers. We provide experimental results showing that MBR recognizers can yield better word recognition accuracy than the commonly used maximum a-posteriori probability (MAP) recognizer. Segmental MBR decoding procedures which 'chop up' the hypothesis spaces produced by ASR systems into manageably sized pieces for efficient search will also be described.