Intro to the SRI Language Modeling toolkit

This page is intended to be a quick and simple introduction to the SRI language modeling toolkit. The full documentation (such as any exists) consists of man pages available from the main SRILM web page.

Other documentation is available in the /srilm/doc directory where the toolkit is installed.

For EE517, SRILM is installed on the EE department linux cluster. It is currently in /condor/EE517/SRILM.

NOTE: If you get weird errors about EOF reached before /end/, you may have run out of disk space. One student has already encountered this problem; for some reason, instead of getting an error from the OS saying there was no more space, the toolkit ended up processing a half-completed file and gave a weird error that took us a little while to figure out. So, if you think you're having similar problems, clean up some space on your account and try again.

The binaries are in /condor/EE517/SRILM/bin/i686. This includes ngram, ngram-count, and ngram-class, which are probably the first 3 programs from the toolkit that you will want to use.

If you want to run the tools without using their full path names, you can add two directories to your path (how you add these depends on which shell you are running):

SRILM/doc contains at least one important file (feel free to look at the others, too)
lm-intro:
This file gives a good introduction to building basic language models with the SRI toolkit. Note that the programs it refers to are located in /condor/EE517/SRILM/bin/i686

The documentation does not provide examples for using ngram-class to automatically induce classes from data. (This is a new feature of the toolkit, and while there is a man page for ngram-class, they don't seem to have added it to the rest of the documentation.) Here's an example of how to build and use a simple class language model:

Induce classes:

ngram-class -vocab vocab_file \
            -text input_file \
            -numclasses num \
            -class-counts output.class-counts \
            -classes output.classes 
In this example,

Estimate a bigram language model using the classes generated in the previous step:

ngram-count -order 2 \
            -read output.class-counts \
            -write output.ngrams
ngram-count -order 2  \
            -read output.ngrams \
            -lm  output.bo
In this example,

To calculate perplexity,

ngram -lm output.bo -classes output.classes -ppl test.txt


This page is maintained by Sarah Schwarm / sarahs@cs.washington.edu.