Last updated $Date: 2003/12/09 19:56:59 $ -------------------------------------------- Notes compiled by Katrin Kirchhoff I. Niessen & Ney, "Morpho-syntactic analysis for reordering in statistical machine translation" - work motivated by language pair: German - English - problem: differences in word order (fairly strict in English, freer in German) long-distance dependencies - previous approaches: different alignment models (Och et al.) models incorporating phrase structure (Wang & Waibel, Och et al.) - here: preprocess input to make word orders in two languages more similar - input word string is transformed into string with more uniform word order - two phenomena are addressed: one syntactic, one morpho-syntactic (requires word knowledge) - first: question inversion in both English and German, word order in questions is inverted compared to word order of declarative sentences You are crazy. Du bist verrueckt. Are you crazy? Bist du verrueckt? You believe me. Du glaubst mir. Do you believe me? Glaubst du mir? - second: separated verbal prefixes class of German verbs that consists of verb stem and prefix in infinitival form: form one word in other syntactic constructions requiring finite verb forms: prefix separated from stem e.g. abfliegen (leave by plane, take off) Ich muss morgen abfliegen. Ich fliege morgen (..long ADVP...) ab. - how to do transformations on input word strings: 1. commercial parsers applied to German and English texts to obtain morpho-syntactic annotation includes part-of-speech, secondary grammatical categoriess (tense, person, number, case, etc.) 2. for question inversion: word order is converted to that of declarative sentences ie subject and verb are inverted, in English, supporting "do" is removed based on constituents as defined by the parser used in step 1. done for both English and German 3. for prefixed verbs: extract all separable verb forms from training corpus -> yields list of form prefix|main in each sentence that contains matching main part and prefix in clause-final position, prefix is attached to verb translation model is retrained done for direction German - English but also for reverse direction English - German, which requires post processing of German translation output (shift prefix into some place separate from verb) sometimes prefix can be put into other positions in addition to clause-final; depends on written vs. spoken, given-new distinction, formality use language model to decide placement of prefix Experiments: Verbmobil corpus, alignment-template based statistical MT system (discussed previously) pre-BLUE work, so evaluated by SSER, semantic error rate RESULTS: - results improve - improvement is larger when training data is reduced Other questions/comments: potential problem: what is one word and what is verb plus additional adverb? Wir verhandeln morgen weiter. We negotiate tomorrow on. ( = We'll continue our negotiation tomorrow) Wir muessen morgen weiter verhandeln/weiterverhandeln. trade-off between low-entropy SMT model and amount of training data per word. in general, very detailed expert knowledge not always available for all language pairs useful to see if similar things can be done with more data-driven methods (shallow parsing, NP chunking, etc.) applicable to other languages (English noun compounds, Chinese...) II. Philipp Koehn and Kevin Knight, "Knowledge Sources for Word-level Translation Models" - impact of various available resources on word-level translation models word-level translation model: part of a SMT model that specifies number/identity of translation options for that word, along with probabilities ("tablets" in IBM jargon) in original model, tablets were estimated from aligned corpus only - how well can you do when you have: a) only a parallel corpus b) parallel corpus and bilingual lexicon c) parallel corpus, monolingual corpora and bilingual lexicon d) monolingual corpora and bilingual lexicon e) only monolingual corpora in the source and target languages - reason: parallel corpora are difficult to create, not available for a wide range of language pairs - dictionaries and monolingual corpora are much easier to obtain (WWW) - language pair here: German-English - evaluated above approaches based on parallel reference corpus, percentage of word pairs in output corresponding to word pairs in reference labeling - only nouns! a) determine most likely word alignments in training parallel corpus using standard SMT model get word-level translation probabilities from relative frequence counts in alignments: 76.9% correct word correspondences are not found too few training samples in corpus (< 50) lack of constraints in alignment allows for noise if German word is not found in training corpus, just repeated verbatim in English output -> correct in 27% of all cases! (German uses many English words) found one correspondence that was not in dictionary although perfectly good translation b) use lexicon to constrain extraction of word pairs from parallel corpus i.e. only those pairing that are listed as valid translations in the lexicon are retained and pairs not identified by statistical alignment (too infrequent) but by lexicon were included context features of target word context are used in classifier to identify most likely option of several possible translations - three words of local context around target word (POS used as backoff) - any open-class word in same sentence - any open-class word in same document classifier: decision lists (like a set of if-then-else rules with associated probabilities) baseline: choose most frequenct word translation in training data: 88.9% with word sense disambiguation: 89.5% c) monolingual corpus is used to improve decision list approach apply decision list to German corpus and label more occurrences of German words with English translations -> retrain decision list idea: cover a wider range of contexts only increases choice of most frequent translation d) bootstrap SMT model just from monolingual corpus and bilingual lexicon instead of parallel corpus each word in target corpus receives all of its possible translations based on lexicon SMT model is trained, corpus is re-aligned, process is iterated EM-style training procedure, similar to training unsupervised tagger result: 79.0% e) only monolingual corpora: unsupervised acquisition of a bilingual lexicon - use a seed lexicon and data-driven methods of identifying possible translations - seed lexicon: words that have same form in English and German (11.9% accuracy) - context/frequency of cooccurrence - words co-occuring frequently have translations that cooccur frequently - similar spelling due to same roots (mother - Mutter) - relationship to other words (dog vs cat - Hund vs. Katze), identified by same context - words occurring frequently have translations that occur frequently - similar spelling: 25.4% accuracy - context: 31.9% - both: 38.6% (only measured on 1000 most frequent words) Summary: parallel corpus (a); 76.9 (b) 88.9% with word-sense disambiguation: 89.5% (c) ??? (not good) (d) 79.0% (e) not comparable, but interesting -------------------------------------------- From Jeremy Kahn: > The commander Forbin of janson, being at a repast with a celebrated > Boileau, had undertaken to pun him upon her name:--"What name," told > him, "carry you thither? Boileau: I would wish better to call me > Drink wine." The poet was answered him in the same tune:--"And you, > sir, what name have you choice? Janson: I should prefer to be named > John-Meal. The meal don't is valuable better than the furfur?" Didn't make sense? Well, don't worry. http://crossroads.net/honyaku/easis/ is where it came from: an 1883 traveler's guide to English, written by two Portuguese translators who didn't speak English and apparently had only a Portuguese-French and a French-English translation dictionary to work with. (Whether they spoke French I haven't found out). The Village Voice has an old review of this book at http://www.villagevoice.com/issues/0232/kite.php. I find it absolutely hilarious, especially the "Familiar phrases" section: > Dry this wine. > He laughs at my nose, he jest by me. > He has spit in my coat. > He has me take out my hairs. > He does me some kicks. > He has scratch the face with hers nails. > He burns one's self the brains. > He is valuable his weight's gold. These guys obviously have no language models for English to prune their hypotheses, but beyond that -- what kind of conversations were they expecting people to have? -------------------------------------------- From Emily Bender: A few quick notes to add to those already posted for 12/8, on Niessen & Ney: -- The issues that arise with German separable-prefix verbs also arise to some extent with English "phrasal verbs", i.e., verbs that necessarily appear with a preposition-like particle. These can be identified (if they are transitive) because the particle can appear before or after the object: (1) I looked up the answer. (2) I looked the answer up. Many (most?) English phrasal verbs are non-compositional in their semantics, that is, the meaning of "look up" in (1) and (2) is not entirely predictable on the basis of the meanings of look and up in isolation, and they certainly will often translate to single words in other languages which are not related to the verb part of the phrasal verb. Thus to the extent that an MT system could know when it was dealing with a phrasal verb and when it was dealing with a superficially similar verb plus preposition combination (3), alignment and word translation would be improved. (3) I looked up into the sky. -- The Constraint Grammar parser appears to be a combination of a morphological analyzer, POS tagger, and syntactic function tagger. That is, it's not returning syntactic structures, but rather syntactic dependency tags (e.g., subject, premodifying noun), and much ambiguity is avoided by only providing single tags in cases of systematic ambiguity (e.g., adjective v. noun uses of 'brown'). The original version of the English parser is described here: http://www.ling.helsinki.fi/~tapanain/cg/engcg.txt A thoroughly revised and updated version is described here: http://www.ling.helsinki.fi/~tapanain/dg/doc/index.html And on Koehn & Knight: -- The similar spelling heuristics for acquiring a bilingual lexicon from monolingual texts might also be somewhat effective for Chinese and Japanese, because of the massive borrowing of Chinese words (with their kanji) into Japanese. There are of course prominent examples where the meanings have drifted (I think that the Japanese word for letter is supposed to translate to toilet paper in Chinese), there are also many cases where they are the same. Any attempt to do so would have to account for systematic differences in the way the characters themselves are written. In general, the Japanese system is closer to the unsimplified (Taiwan) version, but it is simplified in some respects, and not always in the same way as the characters have been simplified in Mainland China. --------------------------------------------