Additional Feature Corpora

The Multimedia Commons resources include several sets of computed features contributed by researchers at collaborating institutions (in addition to the YLI Feature Corpus), extracted from the images in the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset. The goal of sharing these feature sets is to enable multimedia researchers using the YFCC100M to focus on advancing science, rather than spending time and compute power all deriving the same data. The features are identified using MD5 hashes that can be used to link them with the original media and with the YFCC100M metadata.

This page provides details about each of the feature sets, including how they were computed and how they are organized on the Amazon Web Services (AWS) Public Data Sets platform.

Jump to:
VLAD-VGG-YFCC Dataset: VLAD, VGG-CNN (on Static Images)
ISTI Deep Feature Corpus: Hybrid-CNN (on Static Images)
Features and Annotation Sets Hosted on Other Sites (CMU)

For additional Multimedia Commons feature sets, see the YLI Feature Corpus page.


VLAD-VGG-YFCC Dataset

The VLAD-VGG-YFCC Dataset includes VGG (CNN) features and VLAD-1024 vectors computed on each of the static images in the YFCC100M.

General Information About the VLAD-VGG-YFCC Dataset

Separate Directory: All of the features in the VLAD-VGG-YFCC dataset may be found in the features/image/vgg-vlad-yfcc/ directory, which has subdirectories for each of the specific feature sets. (This is unlike the other feature sets, which each have a directory directly under features/image/ within the multimedia-commons S3 data store on AWS Public Data Sets. The organization of the files within the subdirectories is also different from that of the other feature sets. We hope to reorganize these files at a later time, to parallel the structure of the other features.)

Caveat — Missing Features: Because some of the media files were removed by their owners between the initial publication of the YFCC100M and when the image files were downloaded for this project, a very small percentage of the images listed in the YFCC100M metadata index do not have features in the VLAD-VGG-YFCC Dataset. (Additional features are missing specifically from the VLAD set; details below.)

License: The VLAD-VGG-YFCC feature sets are licensed by ITI-CERTH and CEA List under a Creative Commons Attribution license (CC-BY), meaning they are free for any use as long as credit is given to the producers (see license for details). (Use of the original YFCC100M metadata and media is subject to the Creative Commons licenses chosen by the uploaders, which may have different specifics.)

Documentation/Preferred Citation: Adrian Popescu, Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Hervé Le Borgne, and Yiannis Kompatsiaris. Towards an Automatic Evaluation of Retrieval Performance With Large Scale Image Collections. In Proceedings of the ACM Multimedia 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions (MMCommons ’15), Brisbane, Australia, October 2015. PDF of Preprint.

Cheers To: The VLAD-VGG-YFCC Dataset is a joint effort of the Multimedia Knowledge & Social Media Analytics Laboratory at the Information Technologies Institute/Centre for Research & Technology and of CEA List (part of L’Institut de CEA Tech). The VLAD features were wrangled by Symeon Papadopoulos, Eleftherios Spyromitros-Xioufis, and Katerina Andreadou at ITI-CERTH and the VGG features were wrangled by Adrian Popescu at CEA List.

Funding: Work on the VLAD-VGG-YFCC Dataset is funded via the SocialSensor Project (European Commission Grant Agreement #287975), the USEMP Project (User Empowerment for enhanced online management) (European Commission Grant Agreement #611596), and the REVEAL Social Media Verification Project (European Commission Grant Agreement #610928). (Any opinions, findings, and conclusions expressed on this website are those of the individual researchers and do not necessarily reflect the views of the funders.)

Contact: Questions about the VLAD features may be directed to Symeon Papadopoulos at papadop[chez]iti[stop]gr or Eleftherios Spyromitros-Xioufis at espyromi[chez]iti[stop]gr. Questions about the VGG features may be directed to Adrian Popescu at adrian[stop]popescu[chez]cea[stop]fr.

More Information/Mirror: See the VLAD-VGG-YFCC web page for more information about the VLAD-VGG-YFCC Dataset and instructions on how to get it via FTP.

VLAD (Vector of Locally Aggregated Descriptors): Static Images

The PCA-projected VLAD-1024 vectors may be found in the features/image/vgg-vlad-yfcc/vlad/ subdirectory within the multimedia-commons S3 data store on AWS Public Data Sets. This subdirectory also contains some demonstration files that show how the VLAD features can be used in a visual search index.

Image Sizes: VLAD features were computed on image versions with a maximum size of 640 pixels in the longest dimension (currently Flickr’s “medium” size).

Missing Features: Due to gaps in the initial downloads (as well as the owner-removals mentioned above), there are VLAD features for only about 97.6 million of the 99.2 million YFCC100M images. (The missing images are those for which Flickr was unable to automatically generate a new “medium” version when its sizing policy changed in 2010; in other words, the images missing specifically from the VLAD set are all older ones.)

Directory Organization: The subdirectory features/image/vgg-vlad-yfcc/vlad/ contains several items:

  • A subdirectory named full/ that contains the feature files. Within full/:
    • The feature vectors are divided into 10 subfolders (yfcc_0 through yfcc_9). Each subfolder contains 90 gzipped files (part-r-00000.gz through part-r-00089.gz). Note that this organization is not based on the MD5-hashed media identifiers.
    • sample.txt contains a small extract of VLAD vectors for 20 images.
    • A simple Java class (ParseIndex.java) is also provided to demonstrate how the parsing can be done. (Note that the input is the gzipped file; you don’t need to decompress the files.)
  • Base files to support the YFCC100M Visual Search Utility developed at ITI-CERTH:
    • yfcc100m_ivfpq.zip contains a version of the “full” index that was compressed using the Product Quantization technique.
    • learning_files.zip contains the learning files required to extract compatible full and compressed vectors from new images.
  • A README with more details about the feature files and how to use them.

File Layout for Feature Files: Each VLAD feature file (features/image/vgg-vlad-yfcc/vlad/full/yfcc_X/part-r-XXXXX.gz) consists of about 110,000 rows. Each row contains the feature vector for one image. Each row begins with the Flickr URL for the image file followed by a bracketed set of 1024 vector elements. The URL is separated from the vector elements with a tab, while the elements are comma-space separated ([vlad_1, vlad_2, ..., vlad_1024]). There are no column headers.

Tools and Parameters: These are improved VLAD vectors using SURF descriptors and the multiple vocabulary aggregation technique with four visual vocabularies (of k=128 centroids). Their initial dimensions (32,768) were reduced using PCA (principal component analysis) whitening. VLAD-1024 retains the 1024 most significant dimensions.

VGG-CNN Features: Static Images

The VGG CNN (convolutional neural network) features may be found in the features/image/vgg-vlad-yfcc/vgg/ subdirectory within the multimedia-commons S3 data store on AWS Public Data Sets.

Directory Organization: features/image/vgg-vlad-yfcc/vgg/ contains a set of ten gzipped tar archives (yfcc100m_dataset-0.tar.gz through yfcc100m_dataset-9.tar.gz). (For best results, use gunzip and then tar -xf to extract the archive.) Within each archive, there are a thousand feature files (batch_0.txt through batch_999.txt). The numbering of the files and the ordering of the features within the files reflect the ordering of the image metadata in the YFCC100M index, so, for example, vgg/yfcc100m_dataset-0/batch_0.txt contains the VGG features for the first ten thousand images listed in the YFCC100M metadata index. Note that this organization is not based on the MD5-hashed media identifiers.

File Layout: Each VGG feature file contains the data for around ten thousand images, one image per row, following the ordering in the YFCC100M. Each row begins with the MD5-hashed identifier for the image, followed by vgg and the number of dimensions, then a series of values representing the CNN output. The identifying elements are tab-separated, and the values are space-separated. There are no column headers.

Tools and Parameters: The “VGG” convolutional neural network features are those proposed by the Visual Geometry Group (Simonyan and Zisserman). They are extracted with a 16-layer CNN; we exploit the output of the last fully connected layer, fc7, which has 4,096 dimensions.

Coming Soon: Due to the significant size and computational cost of the full VGG-CNN features, we are also computing PCA (principal component analysis)-compressed versions of the VGG features and will release them soon.


ISTI Deep Feature Corpus

The Deep Feature Corpus currently consists of Hybrid-CNN features computed on each of the static images in the YFCC100M.

General Information About the ISTI Deep Feature Corpus

Caveat — Missing Features: Because some of the media files were removed by their owners between the initial publication of the YFCC100M and when the image files were downloaded for the Deep Features project, a very small percentage of the images listed in the YFCC100M metadata index do not have features in the Deep Features Corpus. The Multimedia Commons images for which there are Deep Features (99,158,687) are a superset of those currently available on AWS.

Public-Domain License: The ISTI Deep Feature Corpus feature sets are licensed under Creative Commons 0, meaning they are in the public domain and free for any use. (Use of the original YFCC100M metadata and media is subject to the Creative Commons licenses chosen by the uploaders.) However, we do appreciate credit as indicated.

Cheers To: The ISTI Deep Feature Corpus is an effort of the Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” within the National Research Council of Italy (CNR). The features are wrangled by Fabrizio Falchi, Giuseppe Amato, Claudio Gennaro, and Fausto Rabitti in the ISTI Multimedia Information Retrieval Lab.

Funding: The ISTI Deep Feature Corpus is funded by a grant from the European Commission for EAGLE, the Europeana network of Ancient Greek and Latin Epigraphy (Grant Agreement #325122), and a grant from the Region of Tuscany for “Smart News: Social Sensing for Breaking News” (Grant CUP CIPE D58C15000270008). (Any opinions, findings, and conclusions expressed on this website are those of the individual researchers and do not necessarily reflect the views of the funders.)

Contact: Questions may be directed to Fabrizio Falchi at fabrizio[stop]falchi[chez]isti[stop]cnr[stop]it.

More Information/Mirror: See the Deep Features website for more information about the ISTI Deep Feature Corpus and instructions on how to get it via FTP.

Hybrid-CNN Features: Static Images

The Hybrid-CNN (Convolutional Neural Network) features may be found in the features/image/hybrid-cnn/ subdirectory within the multimedia-commons S3 data store on AWS Public Data Sets. (For the most part, each type of visual/image feature in the Multimedia Commons has its own subdirectory within features/image/).

Directory Organization: Within features/image/hybrid-cnn/, there is a set of gzipped text files. (For best results, use gunzip to unzip them.) Each file is named using the first three digits of the MD5-hashed identifiers of the images described within. For example, the file hybrid-cnn/004.txt.gz contains the Hybrid-CNN output for images 00401022f59c75ac6251bf08a4a through 004ff2d14dd0225c465c48b428361c.

File Layout: Images are listed in random order within each file. Each row within a file begins with the original media ID, followed by the MD5-hashed image identifier, then 4096 float values representing the activations. Rows are tab-delineated. There are no column headers.

Tools and Parameters: Extracted using the Caffe framework. This feature set is based on the activation (after the ReLu) of the neurons in the fc6 layer of the Places Hybrid-CNN, whose model and weights are publicly available in the Caffe Model Zoo. The architecture is the same as for the Caffe reference network.


Features and Annotation Sets Hosted on Other Sites

  • A team at Carnegie Mellon University has released visual and motion features and annotations for all of the videos in the YFCC100M, including dense trajectories, convolutional neural network (CNN) features, and semantic-concept annotations in several domains. Get it here.
Advertisements