$Date: 2004/02/13 00:17:53 $
I was disappointed in the limitation to weights of 0/1 and symmetry. String alignment with crossover could be looked at as a proper translation model.
Also, I'd like to see a version where the substitution weights are dependent on the substitution -- is there a paper following up on this one that does that? Furthermore, if the inversion weights could be conditioned on the parse-ability of the inverted chunks, we might have a real model for translation. Of course, it wouldn't be a "distance" measure, but an English->French divergence measure. I think there's real promise there though.
Of relevance to our research, then -- can we apply this divergence to N strings where N > 2? If we can, then we have a possible way to integrate with multi-language translation models.
This cost measure also implicitly defines an alignment. BLEU does not (I think). In some respects, this is preferable, because the alignments would be useful in (1) identifying problems and (2) improving translations, if we used some of the divergences described above.
Regarding the DP approach to string alignment -- there are beam/A* search techniques that reduce the order substantially, more or less by exploring along the diagonal. (e.g.,
@misc{ kobayashi-improvement, author = "Hirotada Kobayashi and Hiroshi Imai", title = "Improvement of the A* Algorithm for Multiple Sequence Alignment", url = "citeseer.nj.nec.com/498639.html" }
It seems like these could be extended for this kind of search as well, at least for short sequences like sentences.
We had some interesting discussions today! Anyway, here are some of my notes (many of them aren't my original ideas, but are things I found interesting in the discussion):
1. I wonder where us could we apply such a string-to-string distance measure. As someone mentioned, there's no reason to just use the measure to evaluate strings in the same language for MT evaluation purposes. Why not use the measure directly as a translation system? The output of the machine translator would be the parse tree (and subsequent translated string) that achieves the lowest cost.
2. Jeff mentioned that this idea is similar to how dynamic programming was used in speech recognition in the old days. But now it is supplanted by HMM with learned parameters. So there's no reason to learn these costs from training data. For example, to achieve MT evaluation similar to humans, why not provide a set of translated and reference sentences paired with a human-evaluated score. Then, the costs of these distance measure (cost of inversion, substitution, etc) can be learned with EM. It seems that to make evaluation systems closer to human evaluators, we must put some human experience into it!
3. The paper essentially uses the correlation between auto and human evaluation as a merit of success for an evaluation system. Specifically, the paper compared correlations at the sentence level, system level, as well as at the rankings (in table 2 they compared the ranking of sentences). Is this the best way to do it? Is it meaningful? I think this issue opens a whole range of important meta-question: How do you systematically evaluate an evaluation system? An answer to this question is essential. After all, if we evaluate evaluation systems with some bias, and then everyone in the world tries to optimize for that biased evaluation measure, then we get very bad results!
4. On a similar note as (3), I wonder why do we evaluation MT systems in raw numeric scores (ie. BLEU score)? Personally, I think numeric scores skew our understanding of real MT system performance. What really matters to a human reader is whether the machine translated sentences are 1) understandable, and 2) elegant/fluent. The human evaluation process is by nature categorical. Of course, we can assign rough scores to represent different qualities of sentences, but even such scores will vary widely among humans. So instead of designing distance measures that give numeric evaluation scores, why not design a categorical classifier system? The input to the classifier will be the translated and reference sentences. The output of the sentences will just be: 1) understandable, 2) fluent, 3) nonsense. After all, that's how humans will respond to translations. Furthermore, this classifier could be trained with human evaluation data too. So the problem of evaluating MT systems can be framed as a problem in designing the best classifier.
Ok, that's it for now. This was a fun, thought-provoking paper. :)
ps. I'm wondering if we can use a wiki as our MTRG webpage. In that way, comments/notes can be posted at anytime by anyone easily. It seems to be the easiest way to share ideas. Also, you don't need to spend valuable time maintaining the website by yourself all the time. It might also give people more of a sense of ownership, so they'll post their thoughts more often. I'm using a wiki now to share thoughts on research with friends at other university, and it's been pretty useful: http://www.seedwiki.com/page.cfm?wikiid=3740&doc=ResearchDiary (this is a free wiki farm. If we do have a wiki, I'd imagine we'd want our own wiki server so that we can set permissions on who can edit pages, etc.) Anyway, just an idea.
(response from moderator, we want to maintain some form of moderation:).