Notes from our discussion: - why was a non-standard evaluation method used? Test set for the shared data task was split into set used for training and actual evaluation set, leading to a smaller than usual set. Results are therefore not comparable to other numbers published previously. - not clear whether OOVs were less of a problem due to training on a subset of the test set - not fair to compare a model trained on some hand-labeled data to a model using no hand-labeled data at all: the IBM 4 model could have been trained on the hand-labeled data as well - use of the bias feature was not clear - the features that really make the model work are the IBM 4 features - maybe these are all that's needed and the other features are not relevant? In that case the proposed method would boil down to corrective training applied to the IBM 4 model. The evaluation should have included discriminative matching trained only with the IBM-4 features and/or the baseline IBM-4 model with features such as word length etc. included. - error in dice coefficient equation? Should normally be C_E(e) + C_F(f) in the denominator. Is the equation wrong or did they really use a product? - general unhappiness with the way word alignments are currently being evaluated: first, there's too much focus on improving word alignments when it is not clear to what extent they help the overall system. Second, the Hansard corpus in particular seems to have a disproportionately high number of unsure links. Third, multi-to-one word alignments always need to be resolved arbitrarily and it is sometimes really not possible to say which word maps to which other word within a phrase; perhaps some notion of phrase correspondence should finally be incorporated into alignment evaluations.