
Team 6: Visual words: Text analysis concepts for computer vision

August 5, 2009
  • Computational issues
  • 65D19
llustration of SIFT features computed using VLFeat library Project Description: Large collections of image and video data are becoming increasingly common in a diverse range of applications, including consumer multimedia (e.g. flickr and YouTube), satellite imaging, video surveillance, and medical imaging. One of the most significant problems in exploiting such collections is in the retrieval of useful content, since the collections are often of sufficient size to make a manual search impossible. These problems are addressed in computer vision research areas such as content-based image retrieval, automatic image tagging, semantic video indexing, and object detection. A sample of the exciting work being done in these areas can be obtained by visiting the websites of leading research groups such as Caltech Computational Vision, Carnegie Mellon Advanced Multimedia Processing Lab, LEAR, MIT CSAIL Vision Research, Oxford Visual Geometry Group, and WILLOW. One of the most promising ideas in this area is that of visual words, constructed by quantizing invariant image features such as those generated by SIFT. These visual word representations allow text document analysis techniques (Latent Semantic Analysis, for example) to be applied to computer vision problems, an interesting example being the use of Probabilistic Latent Semantic Analysis or Latent Dirichlet allocation to learn to recognize categories of objects (e.g. car, person, tree) within an image, using a training set which is only labeled to indicate the object categories present in each image, with no indication of the location of the object in the image. In this project we will explore the concept of visual words, understand their properties and relationship with text words, and consider interesting extensions and new applications. References: [1] Lowe, David G., Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. doi: 10.1023/b:visi.0000029664.99615.94 [2] Leung, Thomas K. and Malik, Jitendra, Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons, International Journal of Computer Vision, vol. 43, no. 1, pp. 29-44, 2001. doi: 10.1023/a:1011126920638 [3] Liu, David and Chen, Tsuhan, DISCOV: A Framework for Discovering Objects in Video, IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 200-208, 2008. doi: 10.1109/tmm.2007.911781 [4] Fergus, Rob, Perona, Pietro and Zisserman, Andrew, Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition, International Journal of Computer Vision, vol. 71, no. 3, pp. 273-303, 2007. doi: 10.1007/s11263-006-8707-x [5] Philbin, James, Chum, Ondřej, Isard, Michael, Sivic, Josef and Zisserman, Andrew, Object retrieval with large vocabularies and fast spatial matching, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. doi: 10.1109/CVPR.2007.383172 [6] Yang, Jun, Jiang, Yu-Gang, Hauptmann, Alexander G. and Ngo, Chong-Wah, Evaluating bag-of-visual-words representations in scene classification, Proceedings of the international workshop on multimedia information retrieval (MIR '07), pp. 197-206, 2007. doi: 10.1145/1290082.1290111 [7] Yuan, Junsong, Wu, Ying and Yang, Ming, Discovery of Collocation Patterns: from Visual Words to Visual Phrases, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-8, 2007. doi: 10.1109/cvpr.2007.383222 [8] Fergus, Rob, Fei-Fei, Li, Perona, Pietro and Zisserman, Andrew, Learning object categories from Google's image search, IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 1816-1823, 2005. doi: 10.1109/iccv.2005.142 [9] Quelhas, P., Monay, F., Odobez, J.-M., Gatica-Perez, D., Tuytelaars, T. and Van Gool, L., Modeling scenes with local descriptors and latent aspects, IEEE International Conference on Computer Vision (ICCV), pp. 883-890, 2005. doi: 10.1109/iccv.2005.152 [10] Sivic, Josef, Russell, Bryan C., Efros, Alexei A, Zisserman, Andrew and Freeman, William T., Discovering objects and their location in images, IEEE International Conference on Computer Vision (ICCV), pp. 370-377, 2005. doi: 10.1109/iccv.2005.77 Prerequisites: A strong computational background is essential, preferably with significant experience in Matlab programming. (While experience with other programming languages such as C, C++, or Python may be useful, Matlab is likely to be the common language when individual team member contributions need to be integrated into a joint code.) Some background in areas such as image/signal processing, optimization theory, or statistical inference would be highly beneficial.