Web data for Language Modeling
Below are the steps involved in collecting LM training data from the
web. All of the necessary scripts can be downloaded here. There are
also several third-party tools referenced below. Note that everything
after step 2 relates to text filtering and normalization an is
optimized for conversational speech recognition. You should customize
these text normalization tools or develop your own tools to fit your
target application.
1. Generate Google queries (as you
would enter them in the Google toolbar, one query per line).
- choose the most frequent n-grams (e.g. 4-grams) from in-domain
training data, e.g. from Switchboard "know
what i'm talking", "you
know i've got", etc.
- consider combining n-grams into complex queries, e.g. "it
is a"+"you know i've never"+"i like that". This may lead to a
higher relevance in the documents returned by Google, but avoid making
these queries too narrow. Here
are some examples of "conversational-style" queries extracted from
Switchboard.
- If topic-specific data is desired, identify keywords related to
the topic, then combine these keywords with conversational
n-grams (from Switchboard or Fisher) into compound queries, e.g. "smoking"+"i think it was".
- The number of queries should reflect how much data you aim to
collect. The rule of thumb: about
100K-500K words per query, but that will depend on how simple
your queries are.
2. Run search_google.pl
% set
num_words=120000000 # the desired number of words
% search_google.pl $num_words
< queries > raw_web.txt
Note 1. It takes about
24 hours to collect 120M words of text. Don't run more than 2
collection processes in parallel, as that may exceed limits on traffic
that Google will tolerate.
Note 2. Script
search_google.pl keeps track of URLs it downloads from and ignores any
duplicates. In the output, it also prepends each document with its URL
address, e.g. ######
http://<url>. If you want to merge two or more text
files collected independently, you can do it by using combine_sources.pl (this
will filter redundant web pages)
%
cat web1.txt web2.txt webN.txt | combine_sources.pl > web_all.txt
3. Perform some basic filtering
% filter_123.csh < raw_web.txt >
filt_web.txt
Three things happen here:
i) mxterminator is
called to split text into sentences, one per line.
ii) filter_google_plain.pl
performs some basic filtering, including
- keeping only lines that fall within
($min_words, $max_words) range,
- keeping only lines that contain at most $max_oov_rate OOVs (a
dictionary based on the
CMU
lexicon can be used)
Current settings:
$min_words = 3;
$max_words = 120;
$max_oov_rate = 0.25;
iii) filter_google_punct.pl is
used for additional sentence splitting based on internal puntuation
(";" ",") but limiting how short sentences may be ($min_words_per_line
= 4;). This produces shorter sentences that have more resemblance
with Switchboard.
4. Normalize the written text (i.e.
convert into spoken form: expand abbreviations, numbers, etc.). Here we
use NSW tools.
%
/g/ssli/research/packages/nsw/bin/nsw_expand -domain pc110 -output
norm_web.txt filt_web.txt
Depending on how much data you are processing, you may need to split
the data into 5M-word chunks and run nsw_expand in parallel. It takes
about 3-4 hours to normalize 5M words of raw text.
5. Final text cleanup
% filter_google_final.pl <
norm_web.txt > lm-ready_web.txt
Any remaining punctuation, numbers,
formatting characters are removed. Some abbreviations missed by NSW are
expanded. Periods are inserted for all abbreviations, e.g. "c n n"
-> "c. n. n."
Note. Only documents with at least $min_lines (=4) are kept.
6. (Optional) Perplexity filtering
You may use l
m-ready_web.txt for
building LMs, however, if you had collected a very large data set
(300M+ words) you may want to consider perplexity filtering to reduce
the data set size. Here you'd use a domain-matched LM to compute
perplexity of each document and only keep documents with lowest
perplexity. This approach may improve your LM if your web data
collection resulted in a lot of style-mismatched text, which could
happen if you were collecting topic specific data. It may also allow
you to reduce the data size by 20-50% (and hence memory
requirements) with minimal degradation in performance. There two steps
in this process:
i)
generate per-document perplexity output. The assumption is that each
document starts with a line that begins with
######. SRILM
ignores all lines that start with
#, hence the sed command
below.
% sed
's/######/>######/g' lm-ready_web.txt |
/homes/bulyko/SRILM-dev/bin/i686/ngram -lm domain-matched-lm.gz -ppl2 -
-escape ">######" | sed 's/>######/######/g' >
lm-ready_web.ppl2
Note. Command line
option "-ppl2" is used, which is not a part of standard
SRILM toolkit.
ii) Select documents with
the lowest perplexity so that the desired number of words $num_words or
percentage of total number of words $percentage_total (whichever is
smaller) is retained. E.g. select documents that makeup 60% of the
total number of words (but not exceeding 1B words) in
lm-ready_web.txt
% filter_documents.pl
lm-ready_web.ppl2 60 1000000000 < lm-ready_web.txt >
ppl-filt_web.txt
Download and Install scripts
|
1. Download scripts.
2. setenv GOOGLE_HOME /directory/of/your/choice
3. cd $GOOGLE_HOME
4. tar -xzvf scripts.tgz
Note: Adjust paths (e.g. perl)
in scripts according to your environment.