Two questions came up when talking about this paper. 1) this this is an automatic evaluation of automatic evaluation methods, how do we know that ORANGE is the best such meta evaluator. Might there not be other meta evaluators that are better? Also, what of the problem with meta-regress, as there could be automatic evaluation of automatic evaluation of automatic evaluation methods, and so on. 2) When they evaluted the reference transcriptions under BLEU, how did they do it exactly, in other words did they leave that reference transcription out of the pool of other ref. transcriptions when they evaluated BLEU (which would give BLEU a disadvantage relative to the other methods), or if not, wouldn't that make BLEU evaluate the ref. transcription very well since the ref. transcript is both the thing being evalauted, and it is in the gold standard set?