The Boston University Radio News Corpus

The last decade of speech research has seen tremendous gains in computer speech processing technology, as well as in our fundamental understanding of human speech communication, due to corpus-based speech research. There are now a large number of corpora available for speech research, but none for American English with extensive prosodic annotation. The BU Radio News Corpus is designed to fill this gap. The corpus consists of over seven hours of speech recorded from seven radio announcers (4 male, 3 female) taken from actual broadcasts. Subsets of the corpus are automatically labeled with phonetic alignments, part-of-speech tags and hand-labeled prosodic markers. We eventually hope to annotate the entire corpus with these markers, as well as syntactic structure. A version of the corpus is expected to be available from the Linguistic Data Consortium late 1995.

National Science Foundation, NSF IRI-8805680 (12/88-12/91)
National Science Foundation and ARPA, NSF IRI-8905249 (8/89-12/95)
Linguistic Data Consortium (5/93-12/94)


