Although speech recognition research has made significant progress in recent years, the overall performance of speech recognizers still does not attain the level of human speech perception. In particular, performance frequently deteriorates in adverse acoustic conditions such as noise or room reverberation. To overcome these problems speech researchers have looked at enriching the statistical modeling techniques commonly used in speech recognition with expert knowledge about speech production or perception. This talk will focus on the use of knowledge about speech production, i.e. the articulatory processes by which the acoustic speech signal is generated. The first part of the talk will review the potential benefits of articulatory representations in speech recognition. Part two will describe several experiments involving (pseudo)articulatory-based recognition components, which demonstrate the fact that articulatory representations provide information which is complementary to that of standard acoustic speech representations, and which can be successfully integrated to reduce word error rate. The final part of the talk will address the problem of acoustic-articulatory inversion and describe preliminary work on data-driven identification of acoustic cues for articulatory distinctions by rule extraction from trained neural networks.
We present an algorithm to assign unequal amounts of forward error correction (FEC) to compressed images that are transmitted over lossy communication channels, such as wireless networks or the Internet. If data loss occurs, the most important data will at least be received by the decoder. Thus, if a network is congested when someone tries to download an image, instead of the connection stalling while waiting for retransmissions, a slightly degraded view of the image can be displayed. We next present preliminary ideas on applying unequal amounts of FEC for graceful degradation of speech recognition performance over wireless networks.
AudioMining(tm) is an emerging technology at Dragon. It provides a multimedia interface which appears to be content-addressable, and which offers substantial benefit to many customers. Dr. Cohen will discuss the technical issues in delivering AudioMining(tm) solutions, including the state-of-the-art in commercial delivered speech recognition, the data management issues, and the customer interface design and delivery. Current solutions will be demonstrated.
This talk describes our initial attempt in spoken document retrieval using the audio tracks of local television news broadcasts in Cantonese, a major dialect of Chinese. We studied the use of syllable-based units for audio indexing, which include base syllables and tonal syllables as monosyllables, overlapping bi-syllables and tri-syllables. The syllable was compared the word for audio indexing. We performed a known-item retrieval task, using a video archive of 1801 news stories. The stories's transcripts were mapped into syllables by referencing our pronunciation dictionary (CUPDICT) and lexicon (CULEX). The news domain is extremely diverse and many words or terms in the news corpus (54.5 hours) are absent from our lexicons, which affected our retrieval results based on text. Indexing with overlapping bi-syllables (with tone) gave the best average inverse rank (AIR) of 0.83. The incorporation of lexical knowledge effectively reduced the size of the index term set while sustaining retrieval performance. We also attempted retrieval using a speech recognition outputs. Our recognizer was trained mostly on clean, read speech; and had little adaptation on broadcast quality speech. Using base syllables as overlapping bigrams, the AIR degraded to 0.46 due to recognition errors. To bridge the gap between text-based queries and audio-based documents, we also applied a query expansion technique, referencing the syllable recognition confusion matrix for expansion. The technique was found to contribute towards retrieval performance improvement.
In this talk I will outline my past ten years' work on the Optimal Encoding-Decoding Theory of human speech perception, which has been formulated to be directly amenable to machine computation. The theory consists of three basic, integrated elements: 1) approximate motor-encoding --- the symbolic phonological process interfaced with dynamic phonetic process in speech production; 2) robust auditory reception --- speech signal transformation prior to the cognitive process; and 3) optimal cognitive decoding --- optimal (by statistical criteria) matching of the auditory transformed signal with the "internal" model derived from a set of motor encoders distinct for separate speech classes. In this theory, the "internal" model in the brain of the listener is hypothesized to have been "approximately" established during the child-hood speech acquisition process (or during the process of learning foreign languages in adult-hood).In addition to accounting for much of the existing human speech perception data to date, the computational nature of this theory enables it to be used as the basic underpinning of computer speech recognition and synthesis systems. In this talk, I will first focus on the symbolic phonological model (one component of the motor-encoder) constructed based on the concept of overlapping articulatory features (five auto-segmental dimensions of feature streams in the current implementation). This feature-based model serves to parsimoniously represent pronunciation variation and long-span contextual dependency in the human production of spontaneous speech. It incorporates into the phonological representation of arbitrary speech utterances a set of high-level linguistic constraints including morpheme and syllable boundaries, syllable constituent categories, word-level stress, etc. Our on-going work on interfacing the feature-based model to the target-directed, dynamic phonetic model will also be discussed.
Variability in spoken language poses seemingly intractable problems for researchers in linguistics, speech and hearing, engineering, and other fields where normative values for language are sought. Variability has traditionally been treated as noise; something listeners must factor out in decoding the speech signal. However, recent perceptual research indicates that listeners use the information contained in predictable variability during the speech perception. This talk will review several sources of variability that stem from talkers attempts to aid the listener in recovering the signal. It will also present results from recent experimental research on predicting variability.
To date, launching a speech-driven application (such as the United Airlines Flight Information Line or Charles Schwab's Voice-enabled Investing) has forcibly required close collaboration between engine vendor and application owner/developer. Quite frequently there is little or no intersection in range of technical interest and expertise between these groups. This has led to an evolution of ASR engineware that includes far more than just an engine, resembling a portal system, complete with call flow control mechanisms, direct integration with computationally-enhanced telephony line cards, etc. Not surprisingly, the more convenviences are added, the more difficult access to core functionality tends to become. This talk describes one effort in particular to redefine this process and restore control to the user/application developer. From a discussion of software architecture integration philosophy, strategies for processing captured audio data for model retraining, developing custom vocabularies, and strategies for automating data processing, as well as efforts to improve engine performance under adverse conditions, a picture of current "boundary conditions" emerges. Some suggestions for improvement are offered. This talk is also about learning how to create ways for people to interact with a system that is useful, appealing, and non-intimidating (examples will be shown/played). This has been a major element in our development strategy. Extending into the future: the idea is to create a platform for speech-driven applications. I'll provide a guick glimpse of what's around the corner and how it fits in with what's been done so far.