University of Washington
Department of Electrical Engineering
SSLI-LAB: Speech and Language Processing Seminar
Summer Quarter, 2003
RM EE1-303/403
New EE Bldg
University of Washington, Seattle
Wed, 10 September 2003 (EE1 303, 3-4PM)
Hidden Feature Modeling for Speech Recognition Using Dynamic Bayesian Networks
-- Karen Livescu
MIT
Abstract
The majority of current approaches to automatic speech recognition (ASR) use
the phoneme or phone as the basic linguistic unit. Recently, however, there
have been growing doubts about this choice of unit, and a number of research
efforts have been aimed at either replacing the phone or supplementing it with
multiple streams of articulatory or other linguistic features. We refer to
these types of models as hidden feature models, since the features in question
are hidden from the listener (as opposed to acoustic features such as cepstral
coefficients, which are directly measured from the signal).
In our work, we use the framework of graphical models, and in particular
dynamic Bayesian networks (DBNs), to represent hidden feature models.
Graphical models are a natural choice because they allow for the explicit
representation of dependencies between multiple streams of variables, and
because there are standard algorithms for performing maximum-likelihood
parameter estimation and decoding for large classes of models.
This talk will present one class of DBN-based hidden feature models that we
have investigated. We will discuss the issues involved in designing the model
and training the parameters. We will present a factored model of the acoustic
observation probability that we have used to alleviate the inherent sparse
data problems, as well as initial experiments on a continuous digit
recognition task. Finally, we will describe ongoing and future extensions of
our work.
Monday, 30 June 2003 (EE1 303, 10-11AM)
Robust Viterbi Algorithm against Impulsive Noise
-- Manhung Siu
Hong Kong University of Science and Technology
Abstract
The Viterbi algorithm has been successfully applied in different pattern
recognition and communication tasks. However, if some parts of the
observation sequence are corrupted by impulsive noise and this
noise is not accounted for by the distortion measures, performance can
degrade significantly. In this talk, I will describe our proposed
modification to the Viterbi algorithm such that it can handle short,
impulsive noises. We called this the "Robust Viterbi Algorithm". The
underlying principle is to perform detection of corrupted observations
together with the Viterbi search, in effect making a joint decision of
the corruptions and the best path. To make the algorithm applicable
to various environments with different amounts of impulsive noise, we
also introduce an efficient approach for estimating the number of
corruptions based on a likelihood ratio. The effectiveness of this
algorithm is demonstrated in speech recognition problems. Experiments
show that more than 70% error reduction can be achieved relative to
using the standard Viterbi algorithm in a Gaussian replacement noise
environment. Other than speech recognition, I will also describe
briefly how this can be applied for channel coding against an
impulsive noise channel.
Thursday 7 August 2003 (EE1 403, 4:30-5:30PM)
The IBM Multimedia Mining Project
-- Harriet Nock
Audio-Visual Speech Technologies Group
IBM TJ Watson Research Center, Yorktown Heights, NY.
Abstract
The IBM Multimedia Mining Adventurous Research Project is a joint project between the Audio-Visual Speech Technologies Group and the Pervasive Media Management Group. Our goal is to develop an easily-extendable framework for automatically annotating an arbitrary large set of semantic concepts (objects, sites, events) in digital media, particularly digital video. The talk will begin by discussing this goal in more detail and will then give an
overview of recent progress, including tools and statistical modelling techniques that are proving useful. We will then discuss IBM's participation in the annual NIST Video TREC benchmarks, which are large and still expanding cross-company and cross-university benchmarks focusing on (a) automatic semantic annotation and (b) information retrieval from digital
video. In particular, we will highlight some achievements from 2002 and discuss some of the challenges to come in 2003. The talk will also mention briefly other ongoing research in the Audio-Visual Speech Technologies group, including recent progress in audio-visual speech recognition, speaker identification and speaker localisation.
Past Quarter's Seminars
Last updated ($Date: 2003/09/09 20:13:48 $)