University of Washington
Department of Electrical Engineering

SSLI-LAB : Signal, Speech, and Language Interpretation Seminar

Spring Quarter, 2008
RM EEB-403 EEB Bldg (unless otherwise specified)
University of Washington, Seattle

Tuesday, 4th April 2008 (EEB-403, 11:00am-12:00pm)
Three Speech Technology Talks from Visiting Scientists from NTT Japan
-- Takuya Yoshioka, Shoko Araki, Shinji Watanabe
NTT Communication Science Laboratories, Japan

Abstract
Talk One: Maximum Likelihood Approach to Enhancement of Noisy Reverberant Speech Signals
Speaker: Takuya YOSHIOKA, NTT Communication Science Laboratories, Japan
The research on noise suppression has been undertaken for decades, where statistical approaches have achieved successful results. On the other hand, in relatively recent years, the research on speech dereverberation has been attracted much attension. In the past two years, we have proposed a series of speech dereverberation methods based on the maximum likelihood estimation (MLE), and shown the advantages of the MLE-based speech dereverberation methods. In this talk, our latest speech enhancement method will be presented, which integrates the MLE-based noise suppression and speech dereverberation methods. The proposed method is derived by using the expectation-conditional maximization (ECM) algorithm. E-step and CM-steps correspond to noise suppression and dereverberation processes, respectively. Hence, the noise suppression and dereverberation processes of the proposed method are dependent on each other. Experimental results will be given that indicate the advantages of the proposed method over the sequential performance of conventional noise suppression and dereverberation methods.

Talk Two: Speaker diarization and speech enhancement in meetings and conversations
Speaker: Shoko ARAKI, NTT Communication Science Laboratories, Japan
We will introduce a speaker diarization system that uses a small number of microphones to estimate who spoke when. Our proposed speaker diarization system is realized by using a noise robust voice activity detector (VAD), a GCC-PHAT based direction of arrival (DOA) estimator, and a DOA classifier. Using the estimated speaker diarization result, we can also enhance the utterances of each speaker with a maximum signal-to-noise-ratio (MaxSNR) beamformer. The evaluation results with a standard performance measure, the diarization error rate (DER), will be reported. Even for the real conversations in a real room (RT=350ms), the speaker error time was very small with our proposed system. The presentation is based on work reported at ICASSP2008.

Talk Three: Incremental Adaptation Based On A Macroscopic Time Evolution System
Speaker: Shinji WATANABE, NTT Communication Science Laboratories, Japan
Incremental adaptation techniques for speech recognition are aimed at adjusting acoustic models quickly and stably to time-variant acoustic characteristics related to such factors as changes of speaker, speaking style, and noise source over time. In this paper, we propose a novel incremental adaptation framework based on a macroscopic time evolution system, which models the time-variant characteristics by successively updating posterior distributions of acoustic model parameters. The proposed incremental update involves a prediction and correction step in accordance with the Kalman filter theory, and this achieves quickness and stability in adaptation. We also provide a unified interpretation of the proposal and the two major conventional approaches of indirect adaptation via transformation parameters (e.g. Maximum Likelihood Linear Regression (MLLR)) and direct adaptation of classifier parameters (e.g. Maximum A Posteriori (MAP)). We reveal analytically and experimentally that the proposed incremental adaptation involves both the conventional and their combinatorial approaches, and simultaneously possesses their quick and stable adaptation characteristics. ---------------------------------------------------------


Past Quarter's Seminars


Last updated ($Date: 2008/04/07 22:53:50 $ UTC)