Web data for Language Modeling


Existing Resources

English CTS
Switchboard-related Conversational style (191M words, 350MB)
Fisher-related Conversational style (525M words, 1GB)
Fisher topics-related (175M words, 330MB)
English Meetings Speech
Data related to topics in meeting transcripts from CMU, ICSI and NIST (150M words, 275MB)
Mandarin CTS
Mandarin Conversational data (100M words, 220MB)
English LM data for Mandarin-English MT
Taipei Times 2003 archives (~20M words, 25MB)


Collect Your Own Data

Below are the steps involved in collecting LM training data from the web. All of the necessary scripts can be downloaded here. There are also several third-party tools referenced below. Note that everything after step 2 relates to text filtering and normalization an is optimized for conversational speech recognition. You should customize these text normalization tools or develop your own tools to fit your target application.

1. Generate Google queries (as you would enter them in the Google toolbar, one query per line).
2. Run search_google.pl

% set num_words=120000000  # the desired number of words
% search_google.pl $num_words < queries > raw_web.txt

Note 1. It takes about 24 hours to collect 120M words of text. Don't run more than 2 collection processes in parallel, as that may exceed limits on traffic that Google will tolerate.

Note 2. Script search_google.pl keeps track of URLs it downloads from and ignores any duplicates. In the output, it also prepends each document with its URL address, e.g. ###### http://<url>.  If you want to merge two or more text files collected independently, you can do it by using combine_sources.pl (this will filter redundant web pages)
     % cat web1.txt web2.txt webN.txt | combine_sources.pl > web_all.txt

3. Perform some basic filtering

% filter_123.csh < raw_web.txt > filt_web.txt

Three things happen here:
i) mxterminator is called to split text into sentences, one per line.
ii) filter_google_plain.pl performs some basic filtering, including
- keeping only lines that fall within ($min_words, $max_words) range,
- keeping only lines that contain at most $max_oov_rate OOVs (a dictionary based on the CMU lexicon can be used)
     Current settings:
$min_words = 3;
$max_words = 120;
$max_oov_rate = 0.25;
iii) filter_google_punct.pl is used for additional sentence splitting based on internal puntuation (";" ",") but limiting how short sentences may be ($min_words_per_line = 4;).  This produces shorter sentences that have more resemblance with Switchboard.

4. Normalize the written text (i.e. convert into spoken form: expand abbreviations, numbers, etc.). Here we use NSW tools.

% /g/ssli/research/packages/nsw/bin/nsw_expand -domain pc110 -output norm_web.txt filt_web.txt

Depending on how much data you are processing, you may need to split the data into 5M-word chunks and run nsw_expand in parallel. It takes about 3-4 hours to normalize 5M words of raw text.

5. Final text cleanup

% filter_google_final.pl < norm_web.txt > lm-ready_web.txt

Any remaining punctuation, numbers, formatting characters are removed. Some abbreviations missed by NSW are expanded. Periods are inserted for all abbreviations, e.g. "c n n" -> "c. n. n."
Note. Only documents with at least $min_lines (=4) are kept.

6. (Optional) Perplexity filtering

You may use lm-ready_web.txt for building LMs, however, if you had collected a very large data set (300M+ words) you may want to consider perplexity filtering to reduce the data set size. Here you'd use a domain-matched LM to compute perplexity of each document and only keep documents with lowest perplexity. This approach may improve your LM if your web data collection resulted in a lot of style-mismatched text, which could happen if you were collecting topic specific data. It may also allow you to  reduce the data size by 20-50% (and hence memory requirements) with minimal degradation in performance. There two steps in this process:

i) generate per-document perplexity output. The assumption is that each document starts with a line that begins with ######.  SRILM ignores all lines that start with #, hence the sed command below.

% sed 's/######/>######/g' lm-ready_web.txt |  /homes/bulyko/SRILM-dev/bin/i686/ngram -lm domain-matched-lm.gz -ppl2 - -escape ">######" | sed 's/>######/######/g' > lm-ready_web.ppl2

Note. Command line option "-ppl2" is used, which is not a part of standard SRILM toolkit.

ii)  Select documents with the lowest perplexity so that the desired number of words $num_words or percentage of total number of words $percentage_total (whichever is smaller) is retained. E.g. select documents that makeup 60% of the total number of words (but not exceeding 1B words) in lm-ready_web.txt

% filter_documents.pl lm-ready_web.ppl2 60 1000000000 < lm-ready_web.txt > ppl-filt_web.txt



Download and Install scripts

1. Download scripts.
2. setenv GOOGLE_HOME /directory/of/your/choice
3. cd $GOOGLE_HOME
4. tar -xzvf scripts.tgz
Note: Adjust paths (e.g. perl) in scripts according to your environment.