Vocal Joystick Vowel Corpus
The Vocal Joystick vowel corpus was the result of a joint effort between the Department of Linguistics and the Department of Electrical Engineering at the University of Washington. The vowels were recorded at the Department of Linguistics Phonetics Lab at the University of Washington. This file contains a brief description of the purpose for creating a vowel corpus and a brief description of the vowel corpus structure.
The corpus has already been reported in a publication, and has also been used in a number of the vocal-joystick related publications. Please see the publications section of the main VJ home page
Ultimately the VJ is supposed to be language independent, user-friendly, and flexible in its application. That is, it should be as easy to learn to master the input vocalizations, as it is to learn to use a mouse regardless of the vowels in the language that a user may speak. Moreover, the vocalizations should be drawn from a set that minimizes the possibility of repetitive use strain and maximizes ease of use. In the world's languages continuous sounds can be drawn from three main classes: 1) vocalic (vowel like) sounds that result from the resonances of the vocal tract shape which can change continuously depending on the jaw, lip, and tongue position as long as there is no significant obstruction in the vocal tract; 2) pitch (rate of vocal fold vibration) which results from a complex interaction between sub glottal (lung) pressure and vocal fold tension (resulting from a variety of muscular adjustments) again as long as downstream adjustments do not impede airflow across the vocal folds; and 3) intensity that generally results from changes in sub-glottal pressure (for voiced sounds).
In vocalic signals, manipulations of pitch, and manipulations of intensity are found as quasi-independent but coexisting elements in every spoken language. That is, every language manipulates vowels independently of pitch and intensity, pitch independently of vowels and intensity, and intensity independently of vowels and pitch, for linguistic purposes. Vocalic, pitch and intensity manipulations are found outside of language in some form in all cultures as well; any vocalization with voicing and an open vocal tract (sighs, laughs, moans, etc.) involve vocalic sounds with pitch and intensity manipulations. This permits the independent manipulation of pitch and intensity while producing vocalic sounds as a continuous parameter.
Generally vocalic sounds (vowels) can be described as occupying points in or as movement through a two dimensional space that is made up of primarily the first two resonances of the vocal tract. Those that remain constant throughout their duration are generally referred to as monophthongs and occupy a constant point in the two dimensional space. Vocalic sounds that change over their duration are referred to as diphthongs or vowel-to-vowel transitions. Diphthongs typically have a single dominant vowel followed by a brief transition, whereas vowel-to-vowel transitions typically have equal durations of two vowels separated by a transition.
In an effort to train the VJ-engine we began a large vowel data collection effort in a controlled environment that would yield a corpus of vocalic sound with pitch and intensity manipulations that was representative of the utterances a user of the VJ-system would use.
The set of vocalic sounds chosen for Vocal Joystick was based on
physiological capabilities of the human vocal tract: the question
was how many equidistant vowel sounds and vowel-to-vowel
transitions are possible to make. The resulting continuous set
represented as a modified version of ARPABET includes nine
monophthongs:
/ii, ee, ae, ah, iu, ax, aa, uu, oo/
and 12 vowel-to-vowel transitions:
/ii-uu, uu-ii, ii-ah, ah-ii, ae-ah, ah-ae, ae-ii, ii-ae, ae-uu, uu-ae, uu-ah, ah-uu/
In addition to the vowels, there were three additional vocalic parameters: duration, amplitude, intonation. The following lists the different manipulations of each elicited for the vowels and vowel-to-vowel transitions:
- Duration: short (1000 ms), long (2000ms) and nudge (a very short production of the vowel)
- Amplitude: quiet, normal, loud, quiet to loud, loud to quiet
- Intonation: level, rising and falling
The Vocal Joystick (VJ) vowel corpus was created in conjunction with the development of the Vocal Joystick, a continuous control mechanism that uses vocalic parameters to control objects on a computer screen (buttons, sliders, etc.).
Corpus and Corpus Contents
The following files and subdirectories are located in the corpus top-level directory:
- /readme.txt - this file
- VJCorpus - the vowel corpus and related documentation
- VJPapers - papers related to the Vocal Joystick project
About vowel transcription in .txt files
The orthographic symbolic depiction of the vowels in the readme.txt files throughout this distribution is a modified version of ARPABET. Symbol modification was necessary because the vowel symbols in ARPABET reflect certain pronunciation tendencies in American English. For instance several ARPABET symbols represent the tendency in American English for diphthongization of the vowels [ey], [ow], and [uw]. In the VJ corpus, the realization of these vowels did not posses diphthongization because the needs of the vocal joystick required pure monophthongs. Therefore some of the ARPABET symbols were not usable (see the table below for the substitutions and additions we used). Additionally, the ARPABET symbol [ix] represents a reduced vowel in English. The pronunciation of this vowel elicited in data collection is less reduced and represents the full vowel typical of languages that have the vowel as a phoneme. Additionally, this sound does not have an ARPABET symbolic representation. We call our symbolic depictions "VJ-BET."
The vowels were elicited in this way to ensure full vowel, ie no vowel reduction, in both the monophthongs and diphthongs. Avoiding reduction and was important for two reasons: 1) vowel reduction results in a change in vowel quality and therefore in a movement in the two dimensional vowel space, 2) reduced vowels in English and other languages tend to be highly variable in their spectral characteristics and therefore make poor training data. Eliciting steady state vowels resulted in data with less variability.
The following is a table of the orthographic symbols used in the VJ readme.txt files compared to the closest corresponding ARPABET symbol:
| VJ-BET | Closest ARPABET Symbol |
|---|---|
| [ii] | [iy] |
| [ee] | [ey] |
| [ae] | [ae] |
| [ah] | ---- |
| [iu] | [ix] |
| [ax] | [ax] |
| [aa] | [aa] |
| [uu] | [uw] |
| [oo] | [ow] |
Although using IPA symbols would have been an optimal choice in representing the vowels throughout the documentation, IPA symbols are not represented in ascii format. We chose to use diagraphs in the .txt files rather than a single grapheme because there are not enough single graphemes to accurately represent the vowels and for consistency.
The IPA symbols corresponding to VJ-Bet can be found in /docs/VJ-Bet.pdf
For recordingMethods.pdf and recordingMethods.doc files, unicode IPA fonts were used and require the unicode IPA extension to view fonts properly.
For more information concerning the VJ Vowel Corpus, please contact the creators:
- Kelley Kilanski - University of Washington, Department of Linguistics
- Jon Malkin - University of Washington, Department of Electrical Engineering
- Xiao Li - University of Washington, Department of Electrical Engineering
- Richard Wright - University of Washington, Department of Linguistics
- Jeff Bilmes - University of Washington, Department of Electrical Engineering
This material is based on work supported by the National Science Foundation under grant IIS-0326382