Feature-Based Automatic Language Identification

Current automatic language identification (LID) systems are mostly phone-based, i.e. language-discriminating information is assumed to be encoded in the statistical regularities governing phone sequences in different languages. This information is usually exploited by performing phone recognition on the test speech signal, using either language-dependent or cross-linguistic acoustic phone models. The resulting phone strings are then rescored using language-dependent phone n-gram models. The language associated with the n-gram model producing the best score is assumed to be the language in question.

While successful to some degree, the phone-based approach suffers from a number of drawbacks. First, sub-phonemic characteristics which might have language-discriminating potential (such as different degrees of vowel nasalization or aspiration of plosives in different languages) can only be modelled at the expense of increasing the number of distinct phone models. Second, the inability of phone-based systems to make use of information below the phone level entails the need for fairly large temporal contexts (around 10s of test speech) for satisfactory performance. Third, when porting a phone-based given LID system to a new language or dialect, new phones and phone sequences may be encountered which were not included in the training data and thus present problems for phone-based n-gram models.

In this project we investigate an alternative approach, viz. LID based on phonetic features instead of phones. Phonetic features are elementary characteristics of speech sounds, such as voicing, nasality, lip rounding etc. All phones of the world's languages can be described and uniquely identified by a fairly compact set of approximately 30 phonetic features. In a feature-based LID system, acoustic models and n-grams are trained for N different feature groups - for each of these groups, a separate feature recognition pass is performed on the test signal, leading to N language-dependent LID scores. These are then combined to yield an overall LID score for each language in the system. A feature-based LID system thus extends the phone-based approach in that not just a single sequence of symbols is used to identify a language, but a combination of N symbol sequences.

We expect significant advantages from this approach with respect to both language identification accuracy and adaptation to new languages. First, given a fixed amount of training data, acoustic feature models and n-grams can be trained more robustly than the corresponding phone models since fewer classes need to be distinguished and training data can be shared across phones. Second, since phones can be decomposed into sets of phonetic features, it is possible to model fine-grained acoustic-phonetic distinctions, such as the cases of plosive aspiration and vowel nasalization mentioned above, without proliferating acoustic models unnecessarily. Furthermore, the possibility of exploiting information below the phone level might also reduce the amount of test material needed before a reliable LID decision can be made. Finally, a feature-based LID offers greater flexibility with respect to including new languages or dialects in the system because models for new acoustic contexts can effectively be synthesized from a small set of well-trained feature models.

SPONSOR: Department of Defense

AWARD PERIOD: July 2000 - March 2003

TEAM MEMBERS:

PUBLICATIONS:


Return to the SSLI Lab Projects Page.