Joint Lexicon and Acoustic Model Design for Spontaneous Speech Recognition

Automatic speech recognition systems typically include a model representing the acoustic patterns of sub-word units, a lexicon specifying the word pronunciation in terms of these units, and a language model that characterizes the likelihood of different word sequences. Although most parameters in a speech recognition system are estimated from data by use of an objective function, the unit inventory and lexicon are generally hand crafted and therefore unlikely to be optimal. This project involves development of a joint solution to the related problems of learning a unit inventory and corresponding lexicon from data. The initial stage of the work focused on unit design for the case where there is a single pronunciation per word, resulting in a system that significantly outperforms current phone-based approaches on the Resource Management corpus, a 1000 word vocabulary task. This approach requires all words in the lexicon to be observed in training, which is not practical in a large vocabulary task. Therefore, we extended the algorithm for use in a hybrid system that uses automatically derived units when these are more likely than the phone-based counterparts, resulting in a small improvement in recognition for conversational speech (the Switchboard task). Current work focuses on extensions to represent cross-word context and learning multiple pronunciations. The objective is to improve large vocabulary speech recognition performance on spontaneous conversational speech, which has proved to be among the most difficult of all speech recognition problems.

(August 1996 -- May 1999)

SPONSOR: ATR Interpreting Telecommunications Laboratories



Return to the SSLI Lab Projects Page.