================================================================================ From Kevin Duh: here are some comments on today's interesting paper: 1. I wonder how the inaccuracies of image segmentation and vector quantization of segments into blobs/visterms affect the downstream MT alignment. Is there a way to experimentally measure this effect? 2. Can we somehow use the word annotations in the training images to inform segmentation and vector quantization? For example, if we already know that an image is labelled as "tiger", "grass", and "tree", can we use that to inform the segmentation algorithm to generate only 3 segments? Basically, incorporate the words as features or parameters when developing or training the segmentation algorithm. (maybe this has been done already?) 3. The current paper treats the word annotation as a bag of words. Similarly, blobs are bag of blobs. In the meeting, we talked about how the JHU Workshop 2004 folks are putting structure into the blobs using Dynamic Bayesian networks, etc. That seems promising. But how about the structure of the word annotations? I guess this partly depends on the actual database that's used and the annotation strategy, but in general we can think of several times of structure in the annotation: 1) some words may co-occur more often than others, 2) some words may have hierachical relationships, such as "sea" and "waves". 3) what's even more, depending on the annotation scheme, there may even be explicit structures that make clear the spatial relationship in the image, such as "cat on table" vs. "cat under table". How can these things be modeled and incorporated in the object recognition system? I imagine this is a place where stronger NLP can help. 4. The paper basically approaches the problem by saying that MT and object recognition can be seen as *similar* problems. However, it is just as well interesting to ponder, how is standard MT and object recognition using MT *different* from each other? Once we know the differences, we may be better able to see what other advancements in MT we can use to apply to Object Recogntion with MT. For example, they are definitely optimizing for different things: MT wants fluent and coherent translations; object recognition wants accurate alignments of words and blobs. This may be why the authors didn't go beyond IBM Model 2. But perhaps we can use hierarchical syntax-based MT models to incorporate the spatial and resolutional relationships among blobs. This may be doing more to model the structure than the grid-like Markov Random Field structure used in the JHU Workshop 2004. ================================================================================ Other random comments - what happens when the number of words gets large as they are in MT - do they need a Blue score metric (since to really score the quality of their retrieval, since some of the words recalled were better than the annotation). Jeremy suggested they use "verb" for their metric. - how does this approach compare to other info retrieval systems?