Topic Learning in Text and Conversational Speech

Constantinos Boulis


Extracting topics from large collections of data is a crucial step to enhance information access. There has been an abundance of work on supervised topic learning methods on text, yet there are a number of directions in topic learning, that have received less attention, such as constructing feature spaces, unsupervised learning and dealing with different language genres. This dissertation addresses these issues and is concerned with topic learning in text and conversational speech.

In the first half of the dissertation, general approaches to topic learning are investigated. Algorithms to combine different partitions are suggested and evaluated on a number of text corpora, offering improvements compared to established baselines. In addition, a novel feature augmentation method is developed that adds to the bag-of-words representation, a small number of word pairs that exhibit a distinct pattern from their constituting words. The approach is evaluated on different corpora and the results show a consistent performance gain for a number of learning methods.

In the second half of the dissertation, issues that are relevant for topic learning in conversational speech are investigated. In the area of prosody, the studies involve prominence, i.e. loosely defined as phrase-level emphasis given to one or more syllables of a word. Experiments revealed that lack of prominence is an excellent indicator of low-salient words, using average word statistics from an automatic prominence detector. The role of disfluencies is investigated using hand-annotated self-corrections. The experiments reveal that removing disfluencies has little impact on topic classification when using the standard bag-of-words representation. Also, a quantitative analysis of lexical patterns between genders in conversations is conducted, revealing important differences, associated with the gender of the conversational partner. However, integrating gender information in a topic detection system did not improve the topic classification performance. Finally, the impact of the errors introduced by the automatic speech recognition (ASR) component is assessed. A method to cluster words according to a confusability measure derived from the ASR system is proposed and shown to offer performance gains compared to using 1-best transcripts and computational gains compared to using multiple ASR hypotheses.

The full thesis in pdf format.

Return to the SSLI Lab Graduate Students Theses Page.