The COSINE Corpus - COnversational Speech In Noisy Environments

UPDATE -- October 22, 2009: The corpus is now available for download.

Created by: Alex Stupakov, Evan Hanusa, Deepak Vijaywargi, Jeff Bilmes, and Dieter Fox.
University of Washington Electrical Engineering -- The Signal, Speech and Language Interpretation Lab

This work was supported in part by DARPA's ASSIST Program (contract number NBCH-C-05-0137) and an ONR MURI grant (No. N000140510388).


The corpus was originally described in our ICASSP 2009 paper, and a more complete description of the data collection effort as well as the resulting data is given in a journal paper (submitted for publication to Computer Speech and Language). A draft of the paper is available here. The characteristics of the corpus are summarized below.


What is it?

The COSINE corpus is a set of multi-party conversations recorded in real world environments with background noise.
The conversations are recorded on 7-channel wearable recording systems.
The total length of the recordings is 150 hours. (A one-hour conversation between 4 people (each recording 7 channels) counts as 4 hours of recordings)
42.5 hours of this have accompanying word-level transcriptions, and the remaining 107.5 hours can be used for semi-supervised training.
Of the transcribed audio, 9.5 hours are speech and 33 hours are non-speech.

Each speaker wore a portable recording system with 7 microphones: a 4-channel array worn in front of the speaker's chest, as well as a throat microphone, a shoulder-mounted microphone, and a close-talking microphone worn in front of the mouth.

7-microphone portable recording system:
photo of recording device

There were 33 recording sessions total. Recording sessions lasted between 45 and 90 minutes, and had between 2 and 7 participants.
There are pairwise as well as group conversations among the recordings.

The conversations are unprompted - participants were instructed to talk about anything they like, so they spoke about topics that were natural and interesting for them.
A list of possible conversation topics was provided to the participants in case they ran out of things to talk about, though it was very rarely used.
As a result, the conversations are natural and spontaneous.
The conversations take place in a variety of noisy environments, both indoors and outdoors.


Audio Samples

Synchronized segments (16khz, 16bit) from each type of microphone worn by one speaker:

30 second segment - male speech, outdoors, with strong wind:
Close talking mic - Highest quality audio - (Sennheiser ME-3)
Mic array - Stereo file with 2 channels from the mic array
Throat microphone
Shoulder microphone - lots of wind noise in this example
Shoulder microphone (HPF) - same recording, highpass filtered to remove some wind noise

14 second segment - female speech, indoors, in an arcade:
Close talking mic - Highest quality audio - (Sennheiser ME-3)
Mic array - One channel from the mic array
Throat microphone
Shoulder microphone


What can it be used for?

This corpus has been designed to train noise-robust speech recognition systems.

Several aspects of the data can be exploited for achieving improved recognition performance.
* The recorded audio contains natural, spontaneous, conversational speech.
* The recordings were made in environments with a wide range of noise types and noise levels, thus the speech is subject to the Lombard effect.
* Traditional microphone array beam-steering techniques can be used on the 4-channel microphone array recordings.
* The audio from all the channels is synchronized, so multi-stream speech recognition techniques can be used, and mappings from noisy to clean speech can be learned.
* The availability of synchronized speaker turn information can be used to learn conversation dynamics.


How can I get it?

The corpus recordings are available as tar archives of FLAC format files.
They are accompanied by transcriptions as well as participant surveys (which contain demographic data and answers to questions about the speakers' language experience).
Audio quality is 44.1 kHz, 16 bit. The original 48 kHz, 24 bit recordings are available upon request.

To gain access to the download links, please CLICK HERE.

If you have any questions, please get in touch with us.
email: cosine {at} ssli.ee.washington.edu


Valid XHTML 1.0 Transitional