University of Washington
Department of Electrical Engineering
Speech Processing Seminar
Spring Quarter, 2001
1:00-2:00PM Wednesdays
RM EE1 026
New EE Bldg
University of Washington, Seattle
(unless otherwise noted)
26 April, 2001, Thursday (Special Seminar, 1:30-2:30pm, RM 003 EE/CSE Bldg)
Speaking of the Future - Towards Free Form Dialog with Machines
-- David Nahamoo, Human Language Technologies, IBM Research
Abstract
In the next decade, we will experience a major change in the way we
interact with machines. Through the convergence of the web and
wireless telephony, we will have access to services and applications
anywhere, anytime, and on any device. We will accomplish our tasks
through free form conversational dialog with services and applications
using multimodal interaction. In this talk, we will discuss the
current state of conversational technologies including speech
recognition, text-to-speech, dialog, natural language understanding,
speech biometrics, and multimodal interaction through technology
demonstrations and examine what can be expected from the technology in
the next few years.
2 May, 2001, Wednesday (1:00-2:00pm, RM 026 EE/CSE Bldg)
The Information Geometry of EM Variants for Speech and Image Processing
-- Asela Gunawardana, Johns Hopkins University
Abstract
The Expectation Maximization (EM) algorithm is an iterative technique
used in many applications such as speech recognition and medical
imaging, where it is employed to obtain statistical estimates from
incomplete observations of the variables of interest. We analyze the
EM algorithm using the information geometry of Csiszar and Tusnady, in
order to understand how it may be extended. In this framework, the EM
algorithm is viewed as the alternating minimization of the
Kullback-Leibler information divergence between a family of
statistical models and a desired family of distributions
defined by the observed data. Thus, an iteration of the EM algorithm
consists of a forward projection from the model family onto the
desired family, followed by a backward projection from the desired
family to the model family. The well-known GEM variant of the EM
algorithm corresponds to replacing the backward projection with a step
that reduces the divergence rather than minimizing it.
Our contribution lies in showing that the convergence properties of
the EM algorithm are retained when the forward projection is similarly
extended, and that the desired family and the divergence can each be
extended to yield useful estimation schemes. Such extensions of the
EM algorithm yield a proof of convergence for the incremental EM
algorithm of Neal and Hinton, and novel EM variants for estimation
from small amounts of data and for estimation in the presence of
outliers. The application of these EM variants to the problems of
hidden Markov model estimation for automatic speech recognition, as
well as to positron emission tomography will be discussed.
23 May 2001, Wednesday (11-12, RM M406)
Online Unsupervised Adaptation in Speaker Verification
-- Larry Heck, Nuance
Abstract
In this talk, I will present a new approach to on-line unsupervised
adaptation in speaker verification. The approach extends previous
work by (1) improving performance on the enrollment handset-type when
adapting on a different handset-type (e.g., improving performance on
cellular when adapting on a landline office phone), (2) accomplishing
this cross channel improvement without increasing the size of the
speaker model after adaptation, (3) employing a count-based,
parameter-dependent smoothing algorithm that emphasizes the use of
mean parameters in the speaker models until sufficient adaptation data
are present to accurately estimate variances, and (4) developing a new
confidence-based adaptation update weight which minimizes the
corrupting effects on the speaker models from impostor attacks.
Experimental results were completed on a gender-balanced database of
Japanese digits with 5222 speaker models across mixed channel
conditions (landline and cellular). After adaptations on 8 separate
phone calls with a single 8-digit utterance per call and a 12.5%
impostor attack rate, the EER was reduced by 61% (rel.) using the new
unsupervised adaptation approach. This compares favorably to the
(optimal) 84\% reduction in EER resulting from supervised adaptation.
30 May 2001, Wednesday (1-2, RM TBA)
Minimum Bayes-risk Automatic Apeech Recognition
-- Bill Byrne, Johns Hopkins University
Abstract
Automatic speech recognition (ASR) systems are being deployed in diverse tasks
such as human to machine dialogue, language acquisition by non-native speakers,
indexing and retrieval of multi-lingual audio information, and even assistance
to individuals with speech impairment. In observing the variety of uses to
which ASR is put, the question arises whether a uniform ASR architecture is
equally useful for all applications. It may be possible to improve application
specific performance of the ASR systems by adopting a framework that allows
construction of task dependent recognizers. We will discuss the minimum
Bayes-risk (MBR) classification framework as a means of building application
dependent recognizers. We provide experimental results showing that MBR
recognizers can yield better word recognition accuracy than the commonly used
maximum a-posteriori probability (MAP) recognizer. Segmental MBR decoding
procedures which 'chop up' the hypothesis spaces produced by ASR systems into
manageably sized pieces for efficient search will also be described.