University of Washington
Department of Electrical Engineering

Speech and Signal Processing Seminar

Winter Quarter, 2003
RM EE1-303/403 New EE Bldg
University of Washington, Seattle
1-2 Fridays and RM EE1-303 (unless otherwise noted)

Monday, 10 March 2003 (EE1 403, 2:30-3:30PM)
DBN Based Modeling Methodology and its Applications in Speech
-- Dr. Yimin Zhang, Senior Researcher and Manager, Statistical Computing Group
Intel China Research Center, Intel Research Labs

Abstract
DBN (Dynamic Bayesian Network) models have been extensively used for representing speech in recent years. For example, coupled HMMs for Audio-Visual speech recognition is a special case of DBNs. This talk will first give an introduction to BN/DBNs, and survey recent advances in DBN based speech modeling. We will then introduce our research in designing sophisticated speech models, such as synchronous/asynchronous multi-stream models, weighted multi-stream models, AVSR in explicit DBN modeling, and LM speed-up tricks (such as lexical trees) in DBN models important for LVCSR. This talk will also introduce our DBN algorithmic research on efficient DBN viterbi decoding algorithms. Finally, a brief introduction about Intel's research directions in probabilistic computing will be given. Some DBN toolkits will also be introduced. We will show that DBNs can be seen as a graphical programming language that can represent almost everything from acoustic models to advanced language models, which make it especially suitable as a powerful tool to design unified models for speech which may potentially solve hard problems, such as noise-robustness and spontaneous speech recognition. This talk also aims at using extensive examples to help understand some advanced modeling techniques such as deterministic nodes, sparse CPDs and hierarchical modeling etc., in order to illustrate the issues and goals that are indispensable in designing innovative, sophisticated, and tractable models. In addition to researchers/students from the field of speech recognition and language understanding, scientists from other fields like computer vision, bioinformatics etc. who are interested in DBN modeling are also expected to benefit from this talk.

9 January 2003 (EE1 403, 1PM)
Toward Adaptive Conversational Interfaces: Modeling Speech Convergence with Animated Personas
-- Prof. Sharon Oviatt, Center for Human Computer Communication
Department of Computer Science, Oregon Health & Science University

Abstract
The design of robust interfaces that process conversational speech is a challenging research direction largely because users' spoken language is so variable. This research explores a new dimension of speaker stylistic variation by examining whether users' speech converges systematically with the text-to-speech (TTS) heard from a software partner. To pursue this question, a study was conducted in which twenty-four 7-to-10-year-old children conversed with animated partners that embodied different TTS voices. An analysis of children's amplitude, durational features, and dialogue response latencies confirmed that they spontaneously adapt several basic acoustic-prosodic features of their speech 10-50%, with the largest adaptations involving utterance pause structure and amplitude. Children's speech adaptations were relatively rapid, bidirectional, and dynamically readaptable when introduced to new partners, and generalized across different types of users and TTS voices. Adaptations also occurred consistently, with 70-95% of children converging with their partner's TTS, although individual differences in magnitude of adaptation were evident. In the design of future conversational systems, users' spontaneous convergence could be exploited to guide their speech within system processing bounds, thereby enhancing robustness. Adaptive system processing could yield further significant performance gains. The long-term goal of this research is the development of predictive models of human-computer communication to guide the design of new conversational interfaces.

17 January 2003
Pmake in SSLI Lab
-- Prof. Jeff Bilmes

24 January 2003
Multi-Band LSF Representation of Speech for Robust Speech Recognition
-- Prof. Bishnu Atal

Abstract
As automatic speech recognition (ASR) systems are being deployed, the issue of robust performance is becoming increasingly important. The performance of most ASR systems degrades significantly when the system is tested with a microphone or in an acoustic environment that is different from the one when the system was trained. The acoustic front-end representation used widely in current ASR systems is based on mel-frequency cepstral coefficients derived from the short-time power spectrum of speech. Both the linear filtering introduced by different microphones and additive noise impact the short-time power spectrum and therefore the cepstral coefficients. In this talk, I describe an acoustic representation of speech, in which the signal is divided into 16 frequency bands, and each band is represented by two LSF (Line Spectrum Frequencies) parameters at intervals of 25 ms resulting in a sequence of 32-dimensional vectors. The 32-dimensional vectors in five adjacent 25 ms time intervals are joined together to create a vector in a new 160-dimensional space. For each of the phonemes (obtained from a database, such as TIMIT), a linear transformation is used to convert 160-dimensional vectors into orthogonal vectors such that different occurrences of each phoneme in the database are points in a 160-dimensional hyper-spherical space. An utterance of an unknown phoneme is recognized by comparing Euclidean distances of the point corresponding to the unknown phoneme in the 160-dimensional hypersphere from the centers of hyperspheres corresponding to different phonemes and selecting the one with minimum distance. This representation is robust in the presence of distortions introduced by linear filtering or additive noise. The performance of this representation for phoneme recognition in the absence of distortions is comparable to that of cepstral parameters. Since the LSF parameters are computed from the normalized autocorrelation function of the signal in each of the 16 frequency bands, they do not include information about the energy of the signal in any of the frequency bands. But these parameters contain enough phonetic information to provide the same phone recognition performance as the cepstral coefficients for clean speech. Furthermore, their performance is far superior to cepstral coefficients when there is a mismatch between the training and testing conditions due to distortions introduced by linear filtering or additive noise.

31 January 2003
Inequalities between Uncertainty Measures and Error Probability
-- Ozgur Cetin

Abstract
The relationships between relative entropy of a discrete random variable and probability of error in guessing its value from another random variable will be examined. We will derive lower and upper bounds relating entropy to minimum probability of error. Particular attention will be given to Renyi's entropy whose definition and properties will be reviewed. Implications for discriminative parameter estimation algorithms will be mentioned.

7 February 2003
no meeting

14 February 2003
no meeting

20 February 2003 (EE1 403, 1PM)
Progress & Challenges in Converged Communication - A multimodal/multimedia communication and interaction perspective
-- Wu Chou
Avaya Labs Research

Abstract
The convergence of communication, the convergence of communication infrastructure and the convergence of communication services have led to a new paradigm of seamless communication over various network bearer, modality and media. In this talk, we will focus on some current approaches and technical challenges in media processing and dialogue interaction that advance the traditional interactive voice response into seamless interactive multimodal response over the converged communication infrastructure of PSTN, Wireless, Web, and VoIP.

21 February 2003
Enhancing N-gram Language Models with Text Data from the Web
-- Ivan Bulyko

Abstract
Language models constitute one of the key components in modern speech recognition systems. Training an N-gram language model, the most commonly used type of model, requires large quantities of text that is matched to the target recognition task both in terms of style and topic. In tasks involving conversational speech the ideal training material (i.e. transcripts of conversational speech) is costly to produce, which limits the amount of training data currently available. In this work we extract additional training data from the web, searching for text that matches the two tasks under consideration: Switchboard and Meetings. We then use class-dependent interpolation to handle source mismatch when combining different training corpora. Recognition experiments show a significant reduction in WER (0.7-1.8% absolute) due to both additional training data and class-based interpolation.
28 February 2003
No meeting

7 March 2003
Intransitive Classifiers
-- Gang Ji

Abstract
In any pattern classification task, errors are introduced because of the difference between the true generative model and the one obtained via model estimation. One approach to solve this problem uses more training data and more accurate (but often more complicated) models. We introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes trying to compensate (post log-likelihood ratio) for the difference between the true and estimated model scores. This term makes the class comparisons intransitive and we use several tournament-like strategies to deal with this issue. We test a number of new schemes on an isolated-word automatic speech recognition task as well as UCI machine learning data sets. Results on isolated-word recognition as well as UCI data sets show that by using the bias terms calculated this way, the accuracy of classification substantially improves over the baseline.
14 March 2003 (11AM)
TBD
-- Chia-Ping Chen

14 March 2003 (1PM)
Using Speech Quality Measures in Selective Sampling for ASR Training
-- Stephen Juranich

Past Quarter's Seminars


Last updated ($Date: 2003/04/10 00:40:40 $)