Switchboard-I (SWB) is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. SWB is very popular in the speech recognition community and is used almost ubiquitously for the training of large vocabulary conversational speech recognition systems. It consists of about 320 hours of data, of which about 250 hours is speech and the rest is non-speech (silence, noise, etc.)
SWB has been transcribed at the word-level (this means that each accostic waveform file has an associated sequence of words, but typically without human-determined word-boundary marks). In addition, fine-grained phone level annotations (at the frame level) generated in a semi-automatic manner by using a speech recognition system are available. As the speech recognizer has a non-zero error rate, these transcriptions are considered less reliable.
The Switchboard Transcription Project (STP) was undertaken to accurately annotate SWB at the phonetic and syllable levels. One of the goals was that such data could then be used to improve the performance of conversational speech recognition systems. As a result of the time-consuming and costly nature of the task, only 75 minutes of speech segments selected from various SWB conversations were annotated at the phone level and about 150 minutes annotated at the syllable level.
Clearly, having access to STP style annotations of the entire SWB could be a valuable tool for a speech researcher.
While it is not feasible to manually annotate the entire switchboard corpus in STP style, this is an ideal job for graph-based transductive learning. That is, we decided to treat the available STP data as labeled training data, and construct a graph over all of the rest of SWB (at the frame level) for the purposes of semi-supervised learning. We treat STP as labeled and the rest of the SWB data as unlabeled, and we infer the labels on the unlabeled data using modern semi-supervised learning methods (both label propagation, and our method which seems to produce better results). Thus we are using about 75 minutes of labeled data to infer the labels on about 320 hours of speech.
The total number of nodes in this graph ends up being about 118 million and the degree of each node is 10. While SWB is an ideal application for semi-supervised learning, a graph this large is also (as of this writing, May 2009) by far the largest known graph ever to be utilized for semi-supervised learning. It was therefore necessary to carefully parallelize our algorithm on a shared-memory machine (SMP) in order to get this task to complete in a reasonable amount of time. While the underlying goal of this research was to develop semi-supervised learning algorithms and their parallelization, an added benefit for the purposes of further speech research is the completed SWB-I phonetic labeling at the frame level. We have therefore made this labeling available for free use. We have also made available the large graph and transduction sets so that if you wish to compare your SSL algorithm to ours, you may do so.