MULTI-LINGUAL LANGUAGE PROCESSING
List of Resources: Corpora, Tools, Publications

logo
Signals, Speech, & Language Lab
at the University of Washington

Contents

Corpora
Tools
Mailing Lists
Papers
NLP Resource Lists
Other Stuff on the Web

Corpora

LDC - Linguistic Data Consortium
ELRA - European Language Resources Association
Speech Corpora at OGI
Tractor
Monolingual and multilingual corpora of European languages (e.g. Bulgarian, Croatian, Czech, Dutch, English, Estonian, French, Finnish, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Romanian, Russian, Serbian, Slovak, Slovene, Swedish, Turkish, Ukrainian, Uzbek
EuroWordNet: Dutch, Italian, Spanish, German, French, Czech and Estonian
GlobalWordNet
French-English Aligned Hansards of the Canadian Parliament
Free (provided by ISI)
MULTEXT
A series of projects for constructing multilingual corpora and tools. German, Italian, Spanish, French, and English alignment of the Official Journal of European Community available. Tools include multilingual text editor, SGML editor, text segmenter, morpho-lexical tools, multilingual text aligner, and speech tools. Free.
MULTEXT-East
Alignment corpus of George Orwell's English novel 1984 with Eastern Europian languages: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. Free.
ECI Multilingual Corpus
Aligned Turkish-English Corpora at Bilkent University
Label France Magazine
Magazine published by the French Ministry of Foreign Affairs, available in French, English, Spanish, Portuguese, Japanese, Russian, German, and Italian. (link courtesy of Sarah Schwarm)
Floresta Sinta(c)tica Portuguese Treebank

Tools

MULTEXT
A series of projects for constructing multilingual corpora and tools. German, Italian, Spanish, French, and English alignment of the Official Journal of European Community available. Tools include multilingual text editor, SGML editor, text segmenter, morpho-lexical tools, multilingual text aligner, and speech tools. Free.
TreeTagger
A language-independent decision-tree tagger for POS and lemma developed by Helmut Schmid (Univ of Stuggart). Parameter files for German, English, French, Italian, Greek, and old French available, but also easily adaptable to new languages given a lexicon and manually tagged training corpus.
Xerox Morphology Analyzer for Arabic, Dutch, English, French, etc.
only demos
Giza++
ACOPOS
Four open source POS taggers (in C): maximum entropy tagger, transformation-based tagger, example-based tagger, and HMM-based tagger. Here is another site containing trained taggers for Italian (trained on Italian national daily news).
fnTBL
Free POS tagger based on Brill. Also has chunking, EOS detection, and word sense disambiguation abilities. Developed by Florian and Ngai at JHU.
Brill Tagger
POS tagger, Prepositional Phrase Attachment Program, and Unsupervised POS Tagger
FreeLing
Free tools developed at UPC. Includes morphological analysis, NE detection, date/number recognition, POS tagging, and interface to EuroWordNet Top Ontologies. Morphological dictionaries for English (from WSJ), Spanish, and Catalan are also included.

Mailing Lists

Corpora
Main list for corpus-based linguistics. Info about text corpora availability, compiling and using corpora, software, tagging, parsing, bibliography.
Subscribe on the web or email MAJORDOMO@UIB.NO with "subscribe corpora" in message body. The searchable archive is also a great resource of information.

Papers

  1. Woosung Kim and Sanjeev Khudanpur (JHU-CLSP) - Various papers on Cross-lingual language modeling
  2. Hideharu Nakajima, Hirofumi Yamamoto, Taro Watanabe: Language Model Adaptation with Additional Text Generated by Machine Translation. COLING 2002 link
  3. Yarowsky, D., G. Ngai and R. Wicentowski, ``Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora.'' In Proceedings of HLT 2001, First International Conference on Human Language Technology Research (ISBN: 1-55860-786-2), 2001. link to pdf
  4. Franz Josef Och: An Efficient Method for Determining Bilingual Word Classes. pp. 71-76, Ninth Conf. of the Europ. Chapter of the Association for Computational Linguistics; EACL'99, Bergen, Norway, June 1999. link to ps

NLP Resource Lists

Other Stuff on the Web


Maintained by Kevin Duh and Katrin Kirchhoff
Questions or suggestions? Please email: duh (at) ee.washington.edu
http://ssli.ee.washington.edu/people/duh/multilingual/
Last updated: May 01, 2004