Ongoing research on improving the performance of speech-to-text (STT) systems has the potential to provide high quality machine transcription of human speech in the future. However, even if such a perfect STT system were available, readability of transcripts of spontaneous speech as well as their usability in natural language processing systems are likely to be low, since such transcripts lack segmentations and since spontaneous speech contains disfluencies and conversational fillers.
In this work, we experiment with methods to automatically detect sentence boundaries, disfluencies, and conversational fillers in spontaneous speech transcripts. Our system has a two stage architecture in which sentence boundaries and locations of interruption in speech are predicted first and disfluent regions and conversational fillers are identified later. Decision trees trained with prosodic and lexical features and a language model are combined to categorize word boundaries, and rules learned through transformation-based learning are used to identify disfluent regions and conversational fillers. The research looks at different methods of model combination and different categorical representations of word boundaries as a cue to disfluent regions.
Experiments were conducted on the Switchboard corpus using a state-of-the-art STT system with comparison of results to hand transcriptions. The use of both prosodic and lexical information sources and an integrated detection approach gave the best results, with relatively good performance on sentence boundary detection. The system also achieved a high degree of accuracy in conversational filler detection on hand transcripts, but word recognition errors in STT transcripts substantially degraded edit region and conversational filler detection performance.The full thesis in pdf. (0.5 MB)