University of Washington
Department of Electrical Engineering
Speech and Signal Processing Seminar
Winter Quarter, 2003
RM EE1-303/403
New EE Bldg
University of Washington, Seattle
1-2 Fridays and RM EE1-303 (unless otherwise noted)
Monday, 10 March 2003 (EE1 403, 2:30-3:30PM)
DBN Based Modeling Methodology and its Applications in Speech
-- Dr.
Yimin Zhang, Senior Researcher and
Manager, Statistical Computing Group
Intel China Research Center, Intel Research Labs
Abstract
DBN (Dynamic Bayesian Network) models have been extensively used for
representing speech in recent years. For example, coupled HMMs for
Audio-Visual speech recognition is a special case of DBNs. This talk
will first give an introduction to BN/DBNs, and survey recent advances
in DBN based speech modeling. We will then introduce our research in
designing sophisticated speech models, such as
synchronous/asynchronous multi-stream models, weighted multi-stream
models, AVSR in explicit DBN modeling, and LM speed-up tricks (such as
lexical trees) in DBN models important for LVCSR. This talk will also
introduce our DBN algorithmic research on efficient DBN viterbi
decoding algorithms. Finally, a brief introduction about Intel's
research directions in probabilistic computing will be given. Some DBN
toolkits will also be introduced.
We will show that DBNs can be seen as a graphical programming language
that can represent almost everything from acoustic models to advanced
language models, which make it especially suitable as a powerful tool
to design unified models for speech which may potentially solve hard
problems, such as noise-robustness and spontaneous speech recognition.
This talk also aims at using extensive examples to help understand
some advanced modeling techniques such as deterministic nodes, sparse
CPDs and hierarchical modeling etc., in order to illustrate the issues
and goals that are indispensable in designing innovative,
sophisticated, and tractable models. In addition to
researchers/students from the field of speech recognition and language
understanding, scientists from other fields like computer vision,
bioinformatics etc. who are interested in DBN modeling are also
expected to benefit from this talk.
9 January 2003 (EE1 403, 1PM)
Toward Adaptive Conversational Interfaces:
Modeling Speech Convergence with Animated Personas
-- Prof. Sharon Oviatt, Center for Human Computer Communication
Department of Computer Science, Oregon Health & Science University
Abstract
The design of robust interfaces that process conversational speech is a
challenging research direction largely because users' spoken language is so
variable. This research explores a new dimension of speaker stylistic
variation by examining whether users' speech converges systematically with
the text-to-speech (TTS) heard from a software partner. To pursue this
question, a study was conducted in which twenty-four 7-to-10-year-old
children conversed with animated partners that embodied different TTS
voices. An analysis of children's amplitude, durational features, and
dialogue response latencies confirmed that they spontaneously adapt several
basic acoustic-prosodic features of their speech 10-50%, with the largest
adaptations involving utterance pause structure and amplitude. Children's
speech adaptations were relatively rapid, bidirectional, and dynamically
readaptable when introduced to new partners, and generalized across
different types of users and TTS voices. Adaptations also occurred
consistently, with 70-95% of children converging with their partner's TTS,
although individual differences in magnitude of adaptation were evident. In
the design of future conversational systems, users' spontaneous convergence
could be exploited to guide their speech within system processing bounds,
thereby enhancing robustness. Adaptive system processing could yield
further significant performance gains. The long-term goal of this research
is the development of predictive models of human-computer communication to
guide the design of new conversational interfaces.
17 January 2003
Pmake in SSLI Lab
-- Prof. Jeff Bilmes
24 January 2003
Multi-Band LSF Representation of Speech for Robust Speech Recognition
-- Prof. Bishnu Atal
Abstract
As automatic speech recognition (ASR) systems are being deployed, the issue of
robust performance is becoming increasingly important. The performance of most
ASR systems degrades significantly when the system is tested with a microphone
or in an acoustic environment that is different from the one when the system
was trained. The acoustic front-end representation used widely in current ASR
systems is based on mel-frequency cepstral coefficients derived from the
short-time power spectrum of speech. Both the linear filtering introduced by
different microphones and additive noise impact the short-time power spectrum
and therefore the cepstral coefficients.
In this talk, I describe an acoustic representation of speech, in which the
signal is divided into 16 frequency bands, and each band is represented by two
LSF (Line Spectrum Frequencies) parameters at intervals of 25 ms resulting in
a sequence of 32-dimensional vectors. The 32-dimensional vectors in five
adjacent 25 ms time intervals are joined together to create a vector in a new
160-dimensional space. For each of the phonemes (obtained from a database,
such as TIMIT), a linear transformation is used to convert 160-dimensional
vectors into orthogonal vectors such that different occurrences of each
phoneme in the database are points in a 160-dimensional hyper-spherical space.
An utterance of an unknown phoneme is recognized by comparing Euclidean
distances of the point corresponding to the unknown phoneme in the
160-dimensional hypersphere from the centers of hyperspheres corresponding to
different phonemes and selecting the one with minimum distance.
This representation is robust in the presence of distortions introduced by
linear filtering or additive noise. The performance of this representation for
phoneme recognition in the absence of distortions is comparable to that of
cepstral parameters. Since the LSF parameters are computed from the normalized
autocorrelation function of the signal in each of the 16 frequency bands, they
do not include information about the energy of the signal in any of the
frequency bands. But these parameters contain enough phonetic information to
provide the same phone recognition performance as the cepstral coefficients
for clean speech. Furthermore, their performance is far superior to cepstral
coefficients when there is a mismatch between the training and testing
conditions due to distortions introduced by linear filtering or additive noise.
31 January 2003
Inequalities between Uncertainty Measures and Error Probability
-- Ozgur Cetin
Abstract
The relationships between relative entropy of a discrete random
variable and probability of error in guessing its value from
another random variable will be examined. We will derive lower
and upper bounds relating entropy to minimum probability of
error. Particular attention will be given to Renyi's entropy
whose definition and properties will be reviewed. Implications
for discriminative parameter estimation algorithms will be mentioned.
7 February 2003
no meeting
14 February 2003
no meeting
20 February 2003 (EE1 403, 1PM)
Progress & Challenges in Converged Communication
- A multimodal/multimedia communication and interaction perspective
-- Wu Chou
Avaya Labs Research
Abstract
The convergence of communication, the convergence of communication
infrastructure and the convergence of communication services have led
to a new paradigm of seamless communication over various network
bearer, modality and media. In this talk, we will focus on some
current approaches and technical challenges in media processing and
dialogue interaction that advance the traditional interactive voice
response into seamless interactive multimodal response over the
converged communication infrastructure of PSTN, Wireless, Web, and
VoIP.
21 February 2003
Enhancing N-gram Language Models with Text Data from the Web
-- Ivan Bulyko
Abstract
Language models constitute one of the key components in modern speech
recognition systems. Training an N-gram language model, the most
commonly used type of model, requires large quantities of text that is
matched to the target recognition task both in terms of style and topic.
In tasks involving conversational speech the ideal training material
(i.e. transcripts of conversational speech) is costly to produce, which
limits the amount of training data currently available.
In this work we extract additional training data from the web, searching
for text that matches the two tasks under consideration: Switchboard and
Meetings. We then use class-dependent interpolation to handle source
mismatch when combining different training corpora. Recognition
experiments show a significant reduction in WER (0.7-1.8% absolute) due
to both additional training data and class-based interpolation.
28 February 2003
No meeting
7 March 2003
Intransitive Classifiers
-- Gang Ji
Abstract
In any pattern classification task, errors are introduced because of
the difference between the true generative model and the one obtained
via model estimation. One approach to solve this problem uses more
training data and more accurate (but often more complicated) models.
We introduce an information-theoretic based correction term to the
likelihood ratio classification method for multiple classes trying to
compensate (post log-likelihood ratio) for the difference between the
true and estimated model scores. This term makes the class comparisons
intransitive and we use several tournament-like strategies to deal
with this issue. We test a number of new schemes on an isolated-word
automatic speech recognition task as well as UCI machine learning data
sets. Results on isolated-word recognition as well as UCI data sets
show that by using the bias terms calculated this way, the accuracy of
classification substantially improves over the baseline.
14 March 2003 (11AM)
TBD
-- Chia-Ping Chen
14 March 2003 (1PM)
Using Speech Quality Measures in Selective Sampling for ASR Training
-- Stephen Juranich
Past Quarter's Seminars
Last updated ($Date: 2003/04/10 00:40:40 $)