University of Washington
Department of Electrical Engineering

SSLI-LAB: Speech and Language Processing Seminar

Summer Quarter, 2003
RM EE1-303/403 New EE Bldg
University of Washington, Seattle

Wed, 10 September 2003 (EE1 303, 3-4PM)
Hidden Feature Modeling for Speech Recognition Using Dynamic Bayesian Networks
-- Karen Livescu
MIT

Abstract
The majority of current approaches to automatic speech recognition (ASR) use the phoneme or phone as the basic linguistic unit. Recently, however, there have been growing doubts about this choice of unit, and a number of research efforts have been aimed at either replacing the phone or supplementing it with multiple streams of articulatory or other linguistic features. We refer to these types of models as hidden feature models, since the features in question are hidden from the listener (as opposed to acoustic features such as cepstral coefficients, which are directly measured from the signal). In our work, we use the framework of graphical models, and in particular dynamic Bayesian networks (DBNs), to represent hidden feature models. Graphical models are a natural choice because they allow for the explicit representation of dependencies between multiple streams of variables, and because there are standard algorithms for performing maximum-likelihood parameter estimation and decoding for large classes of models. This talk will present one class of DBN-based hidden feature models that we have investigated. We will discuss the issues involved in designing the model and training the parameters. We will present a factored model of the acoustic observation probability that we have used to alleviate the inherent sparse data problems, as well as initial experiments on a continuous digit recognition task. Finally, we will describe ongoing and future extensions of our work.

Monday, 30 June 2003 (EE1 303, 10-11AM)
Robust Viterbi Algorithm against Impulsive Noise
-- Manhung Siu
Hong Kong University of Science and Technology

Abstract
The Viterbi algorithm has been successfully applied in different pattern recognition and communication tasks. However, if some parts of the observation sequence are corrupted by impulsive noise and this noise is not accounted for by the distortion measures, performance can degrade significantly. In this talk, I will describe our proposed modification to the Viterbi algorithm such that it can handle short, impulsive noises. We called this the "Robust Viterbi Algorithm". The underlying principle is to perform detection of corrupted observations together with the Viterbi search, in effect making a joint decision of the corruptions and the best path. To make the algorithm applicable to various environments with different amounts of impulsive noise, we also introduce an efficient approach for estimating the number of corruptions based on a likelihood ratio. The effectiveness of this algorithm is demonstrated in speech recognition problems. Experiments show that more than 70% error reduction can be achieved relative to using the standard Viterbi algorithm in a Gaussian replacement noise environment. Other than speech recognition, I will also describe briefly how this can be applied for channel coding against an impulsive noise channel.

Thursday 7 August 2003 (EE1 403, 4:30-5:30PM)
The IBM Multimedia Mining Project
-- Harriet Nock
Audio-Visual Speech Technologies Group
IBM TJ Watson Research Center, Yorktown Heights, NY.

Abstract
The IBM Multimedia Mining Adventurous Research Project is a joint project between the Audio-Visual Speech Technologies Group and the Pervasive Media Management Group.   Our goal is to develop an easily-extendable framework for automatically annotating an arbitrary large set of semantic concepts (objects, sites, events) in digital media, particularly digital video. The talk will begin by discussing this goal in more detail and will then give an overview of recent progress, including tools and statistical modelling techniques that are proving useful.    We will then discuss IBM's participation in the annual NIST Video TREC benchmarks, which are large and still expanding cross-company and cross-university benchmarks focusing on (a) automatic semantic annotation and (b) information retrieval from digital video.  In particular, we will highlight some achievements from 2002 and discuss some of the challenges to come in 2003.    The talk will also mention briefly other ongoing research in the Audio-Visual Speech Technologies group, including recent progress in audio-visual speech recognition, speaker identification and speaker localisation.


Past Quarter's Seminars


Last updated ($Date: 2003/09/09 20:13:48 $)