Do you remember the concept of Computer Blindness? It is about people intuitively expecting from computer vision algorithms results unreachable by the state-of-the-art methods. Believe or not, recently I fell for that trick too.
I was looking throw Navneet Dalal's slides on Histograms of Oriented Gradients. They contained a lot of frames captured from the movies as examples. There was the following one among them:
It seemed familiar to me. I endeavoured to remember the movie, but I failed.1 So I decided to check out some web-sites that offered the inverse image retrieval.
Sure, first I turned to St. Google. The similar image service disappointed me since it was not able to find the similar image of what I want. Actually, it has some indexed base (not very large), and one can find the similar images only within that base. If one somehow find the image from the base (e.g. using a text query), (s)he is shown a button that allows similar image retrieval.
There are different web-sites for this purpose. TinEye positions itself as a reverse image search engine. However, it failed to find anything. It honestly admitted that nothing similar was found. GazoPa found something, but that was different from what I had expected, though the found images were similar in saturation. It was strange because usually such methods work on greyscale images to be robust to the colour levels shifts.
Then I decided to check if the situation is common, and tried a different image, which I found on my desktop:2
I wanted to find the name of its author and the title3. The result was almost the same. TinEye found nothing, GazoPa found a lot of pictures of different girls.
Why is the web-scale CBIR not possible to date? Because there are simply a lot of images in the web, and it is intractable to perform search in such a large index. The common workflow implies feature extraction and further feature matching. From each image hundreds of features could be extracted. Each feature is a high-dimension vector (e.g. 128D). Suppose we have N images in the index. If we extract 100 features from each (which is fairly the lower bound), to handle the given picture we should match 100 * 100 * N features in 128D space. It is really hard to do it instantly even if N is small. In 128D, indexing structures like kd-trees do not improve performance over exhaustive search, because branch and bound method is unable to reduce the search space in practice (approximate methods partly solve the problem). For example, it took 13 hours to match 150,000 photos on 496 processor cores for the Building Rome in a Day project. This also tells us that there are 150K photos of Rome in the web, so try to imagine the whole number of pictures!
But there is a certain hope for success in the web-scale CBIR. First, when we deal with billions of photos (and trillions of features) kd-trees are likely to give sublinear performance. Second, one could use the web context of an image to seed out come obviously wrong matches.
In the long run, we need to develop more descriptive image representation and learn how to combine the textual content with the content. Also, using the user interest prior could be useful. The engine may start the search from the most popular photos and ignore the least popular ones. Thus, the task could be formulated as a sequential test, where the predictor should be able to say if the found match is good enough, or it should continue search.
UPD (Apr 7, 2010). Here is a rather broad list of visual image search engines.
1 The first thought was about Roman Polanski's "The Pianist", which is surely wrong. If you recognize the movie please let me know!
2 I've seen the picture in the Hermitage Museum and downloaded it after returning from St. Petersburg, because the girl had reminded me my friend Sveta.