A series of projects for constructing multilingual corpora and tools. German, Italian,
Spanish, French, and English alignment of the Official Journal of
European Community available. Tools include multilingual text editor,
SGML editor, text segmenter, morpho-lexical tools, multilingual text
aligner, and speech tools. Free.
Alignment corpus of George Orwell's English novel 1984 with
Eastern Europian languages: Bulgarian, Czech, Estonian, Hungarian,
Romanian, and Slovene. Free.
Magazine published by the French Ministry of Foreign Affairs,
available in French, English, Spanish, Portuguese, Japanese, Russian,
German, and Italian. (link courtesy of Sarah Schwarm)
A series of projects for constructing multilingual corpora and tools. German, Italian,
Spanish, French, and English alignment of the Official Journal of
European Community available. Tools include multilingual text editor,
SGML editor, text segmenter, morpho-lexical tools, multilingual text
aligner, and speech tools. Free.
A language-independent decision-tree tagger for POS and lemma
developed by Helmut Schmid (Univ of Stuggart). Parameter files for
German, English, French, Italian, Greek, and old French available, but
also easily adaptable to new languages given a lexicon and manually
tagged training corpus.
Four open source POS taggers (in C): maximum entropy tagger,
transformation-based tagger, example-based tagger, and HMM-based
tagger. Here
is another site containing trained taggers for Italian (trained on
Italian national daily news).
Free tools developed at UPC. Includes morphological analysis, NE
detection, date/number recognition, POS tagging, and interface to
EuroWordNet Top Ontologies. Morphological
dictionaries for English (from WSJ), Spanish, and Catalan are also included.
Main list for corpus-based linguistics. Info about text corpora
availability, compiling and using corpora, software, tagging, parsing, bibliography.
Subscribe
on the web or email MAJORDOMO@UIB.NO with "subscribe corpora" in
message body. The searchable archive is also a great resource of information.
Hideharu Nakajima, Hirofumi Yamamoto, Taro Watanabe: Language
Model Adaptation with Additional Text Generated by Machine
Translation. COLING 2002 link
Yarowsky, D., G. Ngai and R. Wicentowski, ``Inducing Multilingual
Text Analysis Tools via Robust Projection across Aligned Corpora.'' In
Proceedings of HLT 2001, First International Conference on Human
Language Technology Research (ISBN: 1-55860-786-2), 2001. link to pdf
Franz Josef Och: An Efficient Method for Determining Bilingual
Word Classes. pp. 71-76, Ninth Conf. of the Europ. Chapter of the
Association for Computational Linguistics; EACL'99, Bergen, Norway,
June 1999. link
to ps
Maintained by Kevin
Duh and Katrin Kirchhoff
Questions or suggestions? Please email: duh (at)
ee.washington.edu
http://ssli.ee.washington.edu/people/duh/multilingual/
Last updated: May 01, 2004