To achieve this goal, a baseline LVCSR system is designed as a starting point for adaptation based on an acoustic model that is a parsimonious representation of time variation in that it characterizes a speech segment as a Gaussian process with a polynomial mean trajectory. A maximum-likelihood algorithm to cluster polynomial trajectories is developed and used for parameter tying here, and later for adaptation. Recognition performance with this system is demonstrated to be comparable to other state-of-the-art models for LVCSR.
Parametric trajectory models, unlike non-parametric models, allow joint adaptation of parameters of the trajectory using all observations for that segment. Maximum-likelihood and Bayesian adaptation algorithms for such models are developed assuming independence between parameters of different sound classes, where the classes are determined by clustering.
Finally, the dependencies between different sound classes in the speech of a particular speaker are modeled as a Gaussian multiscale process defined by the evolution of a stochastic linear dynamical system on a tree. To adapt all sound classes with limited adaptation data, adaptation is viewed as optimal smoothing of such a process. Smoothing algorithms for such processes have been developed in the past, but parameter estimation of the process from data was largely an unsolved problem. A maximum-likelihood solution for parameter estimation based on the expectation-maximization algorithm is provided for dynamical systems defined on trees.
Results are presented on the Wall Street Journal and Switchboard corpora, and recognition performance gains are achieved in both supervised and unsupervised adaptation scenarios.
The full thesis in postscript format. (1.20 MB)