YLI Feature Corpus

YLI Corpus Logo

The YLI Feature Corpus

The YLI Feature Corpus includes audio, visual, and motion features commonly used in multimedia analysis and research, computed on each of the images and videos in the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset. The goal of the YLI is to enable multimedia researchers using the YFCC100M to focus on advancing science, rather than spending time and compute power all extracting the same features. The features are identified using MD5 hashes that can be used to link them with the original media and with the YFCC100M metadata.

This page provides details about how the features were computed and how they are organized on the Amazon Web Services (AWS) Public Data Sets platform.

Jump to:
Overview of the YLI Feature Corpus
Audio Features for Videos: MFCCs, Kaldi, SAcC
Image Features for Static Images: LIRE, Gist, SIFT
Keyframe Image Features for Videos: LIRE, Gist, SIFT, CaffeNet

This page contains information about features in the YLI Corpus ONLY! For additional feature sets, see the Other Feature Corpora page.

Sample image: Stars over snow

Overview of the YLI Feature Corpus

The YLI Feature Corpus is a work in progress; new features become available on a regular basis. Features currently available for (almost) all of the images and videos in the YFCC100M include:

  • Visual features computed for static images: LIRE features (auto color correlogram, basic features, CEDD, color layout, edge histogram, fuzzy color and texture histogram (FCTH), fuzzy opponent histogram, Gabor features, joint histogram, joint opponent histogram, scalable color, simple color histogram (RGB), and Tamura texture), Gist features, and SIFT features (scale-invariant feature transform)
  • Visual features computed for keyframes from videos: LIRE features
  • Audio features computed for audio tracks from videos: MFCC20s (mel frequency cepstral coefficients), Kaldi pitch, SAcC pitch (subband autocorrelation classification)

We are currently in the process of adding Gist and SIFT features for video keyframes. Other features under consideration include CaffeNet/AlexNet features for images and keyframes, and dense-trajectory and optical-flow motion features for videos.

Important Caveat: The feature sets are not complete at this time, as we were initially unable to download some of the images and videos. In addition, some images were downloaded incorrectly, resulting in invalid features. However, we have obtained most of the remainder of the media files and will be working on computing or recomputing features for them. (Because of the timing of the metadata and media-file collection processes, a very small percentage of the original images and videos are not available at all; we will not be able to provide features for those items at all.) Details about what’s missing for each type of feature are below.

Full Corpus vs. Subsets: For researchers who only want to work with specific annotated subcorpora (YLI-MED or YLI-GEO), we are releasing separate features bundles for just those subcorpora. The explanations below about how features were computed and organized apply to the subcorpus feature bundles as well as to the YLI corpus as a whole, which includes features for (almost) all of the images and videos in YFCC100M.

Public-Domain License: The feature sets in the YLI Corpus are licensed under Creative Commons 0, meaning they are in the public domain and free for any use. (Use of the original YFCC100M metadata and media is subject to the Creative Commons licenses chosen by the uploaders.) However, we do appreciate credit as indicated.

Documentation/Preferred Citation: Julia Bernd, Damian Borth, Carmen Carrano, Jaeyoung Choi, Benjamin Elizalde, Gerald Friedland, Luke Gottlieb, Karl Ni, Roger Pearce, Doug Poland, Khalid Ashraf, David A. Shamma, and Bart Thomee. 2015. Kickstarting the Commons: The YFCC100M and the YLI Corpora. In Proceedings of the ACM Multimedia 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions (MMCommons ’15). PDF.

Cheers to: The feature corpus is a collaborative effort between the International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory, who are working together to process and analyze the user-generated content in YFCC100M. The YLI features are wrangled by Damian Borth, Carmen Carrano, Jaeyoung Choi, Benjamin Elizalde, Luke Gottlieb, Karl Ni, and Roger Pearce.

Funding: The YLI feature corpus is funded by a collaborative LDRD project led by LLNL under the U.S. Dept. of Energy (contract DE-AC52-07NA27344), a National Science Foundation grant for the SMASH project (grant IIS-1251276), and a FITweltweit fellowship from the German Academic Exchange Service (DAAD). (Any opinions, findings, and conclusions expressed on this website are those of the individual researchers and do not necessarily reflect the views of the funders.)

Contact: Questions may be directed to Jaeyoung Choi at jaeyoung[chez]icsi[stop]berkeley[stop]edu.


Audio Features for Videos

Each type of audio feature in the YLI corpus has its own subdirectory within the features/audio/ directory on the multimedia-commons S3 data store on AWS Public Data Sets. The same structure may be found within the subsets/YLI-GEO/features/audio/ and subsets/YLI-MED/features/audio/ directories as well.

Interval: The frame size for all audio features is 25 ms and the step size is 10ms.

Missing Audio Features: We do not yet have complete sets of audio features for all of the videos, either in the full YLI Corpus or in the YLI-GEO or YLI-MED subsets. In some cases, we were initially unable to download the video or to extract the audio track. In other cases, the videos simply have no audio track in the first place. We are working on getting (most of) the remainder of the video files, extracting audio tracks, and computing features for them. When that process is complete, we will also make an index of videos without audio tracks.

Mel Frequency Cepstral Coefficients (MFCCs)

Directory Organization: The MFCC20 files are provided as a gzipped tarball, mfcc20.tgz. (For best results, use tar -xzf to unzip and untar the archive.) Within the archive, you will find a set of directories and subdirectories based on the first few digits in the MD5-hashed video identifiers. MFCC20s for each video are contained in a separate bzip2ed ASCII file, with a file name based on the full MD5-hashed video identifier. So, for example, the MFCC20 data for the video with identifier 66c4c1b3525aed8d6228854cc53d196 will be found in mfcc20/66c/4c1/66c4c1b3525aed8d6228854cc53d196.mfcc20.ascii.bz2.

File Layout: There are no headers nor frame numbers within the files.

Tools and Parameters: Computed using feacalc (part of SPRACHcore). We extracted the nineteen lowest channels, plus energy (MFCC20s). The specific scripts we used may be found on AWS in the tools/etc/scripts/mfcc directory.

Status:

  • Full Corpus: MFCC20s have been computed and are available for (almost) all of the videos in in the YFCC100M dataset that have audio tracks.
  • YLI-GEO: MFCC20s are available in a separate bundle for (most of) the videos in the YLI-GEO subcorpus with audio tracks.
  • YLI-MED: MFCC20s are available in a separate bundle for (most of) the positive-example videos in the YLI-MED subcorpus.

Kaldi Pitch Tracker

Directory Organization: The Kaldi files are provided as a gzipped tarball, kaldi_pitch.tgz. (For best results, use tar -xzf to unzip and untar the archive.) Within the archive, you will find a set of directories and subdirectories based on the first few digits in the MD5-hashed video identifiers. Kaldi pitch data for each video is contained in a separate bzip2ed ASCII file, with a file name based on the full MD5-hashed video identifier. So, for example, the Kaldi pitches for the video with identifier 66c4c1b3525aed8d6228854cc53d196 will be found in kaldi_pitch/66c/4c1/66c4c1b3525aed8d6228854cc53d196.kaldi_pitch.ascii.bz2.

File Layout: There are no headers nor frame numbers within the files.

Tools and Parameters: Computed using the Kaldi Toolkit. The specific scripts we used may be found on AWS in the tools/etc/scripts/kaldi directory.

Status:

  • Full Corpus: Kaldi pitch features have been computed and are available for (almost) all of the videos in in the YFCC100M dataset that have audio tracks.
  • YLI-GEO: Kaldi features are available in a separate bundle for (most of) the videos in the YLI-GEO subcorpus with audio tracks.
  • YLI-MED: Kaldi features are available in a separate bundle for (most of) the positive-example videos in the YLI-MED subcorpus.

SAcC Pitch Tracker (Subband Autocorrelation Classification)

Directory Organization: The SAcC pitch files are provided as a gzipped tarball, sacc_pitch.tgz. (For best results, use tar -xzf to unzip and untar the archive.) Within the archive, you will find a set of directories and subdirectories based on the first few digits in the MD5-hashed video identifiers. SAcC pitches for each video are contained in a separate bzip2ed ASCII file, with a file name based on the full MD5-hashed video identifier. So, for example, the SAcC pitch data for the video with identifier 66c4c1b3525aed8d6228854cc53d196 will be found in sacc_pitch/66c/4c1/66c4c1b3525aed8d6228854cc53d196.sacc_pitch.ascii.bz2.

File Layout: There are no headers nor frame numbers within the files.

Tools and Parameters: Computed using the SAcC package. The specific script we used may be found on AWS in the tools/etc/scripts/SAcC directory.

Status:

  • Full Corpus: SAcC pitch features have been computed and are available for (almost) all of the videos in in the YFCC100M dataset that have audio tracks.
  • YLI-GEO: SAcC pitches are available in a separate bundle for (most of) the videos in the YLI-GEO subcorpus with audio tracks.
  • YLI-MED: SAcC are available in a separate bundle for (most of) the positive-example videos in the YLI-MED subcorpus.

Image Features for Static Images

Each type of image feature (visual feature) in the YLI Corpus has its own subdirectory within the features/image/ directory on the multimedia-commons S3 data store on AWS Public Data Sets. The same structure may be found within the subsets/YLI-GEO/features/image/ directory as well.

Missing or Invalid Visual Features on Static Images: We do not yet have complete sets of valid image features for all of the static images, either in the full YLI Corpus or in the YLI-GEO subset. In some cases, we do not have any features at all for an image, where we were initially simply unable to download it. Unfortunately, in other cases, our downloads caught placeholder images containing error messages. These placeholder images were included in the feature-computation process, so there are a number of feature sets that pertain to the placeholder/error image rather than the actual original uploaded by the user. We are working on getting (most of) the remainder of the images and computing or re-computing features for them. In the meantime, we will soon be cleaning up the existing feature sets and removing the invalid features.

LIRE Features: Static Images

Directory Organization: Each specific feature type has a subdirectory under features/image/. (There is no LIRE directory per se.) The subdirectories are:

Simple color histogram (RGB): RGB/
Auto color correlogram: acc/
Basic features: bf/
Color and edge directivity descriptor: cedd/
Color layout: col/
Edge histogram: edgehistogram/
Fuzzy color and texture histogram: fcth/
Gabor features: gabor/
Joint histogram: jhist/
Joint opponent histogram: jophist/
Fuzzy opponent histogram: ophist/
Scalable color: scalablecolor/
Tamura texture: tamura/

Within each of the LIRE feature directories, there is a set of text files, each named using the first three digits of the MD5-hashed identifiers of the images described within. For example, the file col/004.col contains the color-layout data for images 00401022f59c75ac6251bf08a4a through 004ff2d14dd0225c465c48b428361c.

File Layout: Each row within a file begins with the MD5-hashed image identifier, the abbreviated feature name, and the number of dimensions, followed by the feature data. Rows end with newline (\n). There are no column headers.

Tools and Parameters: Computed using the LIRE (Lucene Image Retrieval) package. The specific script we used is available on AWS in tools/etc/scripts/Lire-0.9.5.zip.

Status:

  • Full Corpus: LIRE features have been computed and are available for (almost) all of the static images in the YFCC100M dataset.
  • YLI-GEO: LIRE features are available in a separate bundle for (most of) the static images in the YLI-GEO subcorpus.

Gist Features: Static Images

Directory Organization: The features/image/gist/ directory contains a set of text files, each named using the first three digits of the MD5-hashed identifiers of the images described within. For example, the file gist/004.gist contains the Gist descriptors for images 00401022f59c75ac6251bf08a4a through 004ff2d14dd0225c465c48b428361c.

File Layout: Each row within a file begins with the MD5-hashed image identifier, the abbreviated feature name, and the number of dimensions, followed by the feature data. Rows end with newline (\n). There are no column headers.

Tools and Parameters: Computed using Lear’s implementation. Images were resized to 240 pixels (long dimension) for Gist extraction. The specific scripts we used are available on AWS as a tar archive in tools/etc/scripts/lear_gist-1.2.tgz.

Status:

  • Full Corpus: Gist features have been computed and are available for (almost) all of the static images in the YFCC100M dataset.
  • YLI-GEO: Gist features are available in a separate bundle for (most of) the static images in the YLI-GEO subcorpus.

SIFT Features (Scale-Invariant Linear Transform): Static Images

Directory Organization: The features/image/sift/ directory contains a set of gzipped text files, each named using the first three digits of the MD5-hashed identifiers of the images described within. For example, the file sift/004.sift.gz contains the SIFT descriptors for images 00401022f59c75ac6251bf08a4a through 004ff2d14dd0225c465c48b428361c.

File Layout: Each row within a file begins with the MD5-hashed image identifier, followed by: image height (rows), image width (columns), number of keypoints detected (default to max 500), x1, y1 (coordinates of keypoint 1), keypoint 1 size, keypoint 1 strength, keypoint 1 angle, keypoint 1 octave, [128-dimensional descriptor for keypoint 1 with no commas], x2, y2 (coordinates of keypoint 2), keypoint 2 size, keypoint 2 strength, keypoint 2 angle, keypoint 2 octave, [128-dimensional descriptor for keypoint 2 with no commas], and so forth. Rows end with newline (\n). There are no column headers.

Tools and Parameters: Computed using OpenCV. Files were not resized prior to computing SIFT features. (In other words, the image height and width in the SIFT feature file may not match the height and width of the image file in the data/images/ directory.) The specific script we used is available on AWS in tools/etc/scripts/extract_sift.py.

Status:

  • Full Corpus: SIFT features have been computed and are available for (almost) all of the static images in the YFCC100M dataset.
  • YLI-GEO: SIFT features are available in a separate bundle for (most of) the static images in the YLI-GEO subcorpus.

Additional Image Features

Not seeing what you’re looking for in the YLI? Check out the Other Feature Corpora page for information about the Hybrid-CNN, SURF+VLAD, and VGG-CNN features that are also hosted in the Multimedia Commons data store, as well as links to feature sets hosted on other sites.


Keyframe Image Features for Videos

Each type of image feature (visual feature) computed on keyframes extracted from the videos in the YLI Corpus has its own folder within the features/keyframe/ directory on the multimedia-commons S3 data store on AWS Public Data Sets. The same structure may be found within the subsets/YLI-GEO/features/keyframe/ and subsets/YLI-MED/features/keyframe/ directories as well.

Interval: Keyframes for all of the videos were extracted at a rate of one frame per second.

Missing Visual Features on Keyframes: We do not yet have complete sets of image features on keyframes for all of the videos, either in the full YLI Corpus or in the YLI-GEO or YLI-MED subsets, in cases where we were initially unable to download the video or to extract the keyframes. We are working on getting (most of) the remainder of the video files, extracting keyframes, and computing features for them.

LIRE Features: Video Keyframes

Directory Organization: Each specific feature type has a subdirectory under features/keyframe/. (There is no LIRE directory per se.) The features included and subdirectory names parallel those for the static images, described above.

Within each of the LIRE feature directories for keyframes, there is a set of text files, each named using the first three digits of the MD5-hashed identifiers of the videos described within. For example, the file tamura/01a.tamura contains the Tamura-texture data for the keyframes from videos 01a107a8832b9aad4df749d16faaea2 through 01aeb60fca4753f40cbd3051f685.

File Layout: Each row within a feature file begins with the filename for the keyframe, based on its MD5-hashed video identifier plus the keyframe number, then the abbreviated feature name, then the number of dimensions, followed by the feature data. Rows end with newline (\n). There are no column headers.

Tools: See information for static images.

Status:

  • Full Corpus: LIRE features have been computed and are available for keyframes from (almost) all of the videos in in the YFCC100M dataset.
  • YLI-GEO: LIRE features on keyframes are available in a separate bundle for (most of) the videos in the YLI-GEO subcorpus.
  • YLI-MED: LIRE features on keyframes are available in a separate bundle for (most of) the positive-example videos in the YLI-MED subcorpus.

Gist Features: Video Keyframes

Directory Organization: The features/keyframe/gist directory contains a set of text files, each named using the first three digits of the MD5-hashed identifiers of the videos described within. For example, the file gist/01a.gist contains the Gist descriptors for the keyframes from videos 01a107a8832b9aad4df749d16faaea2 through 01aeb60fca4753f40cbd3051f685.

File Layout: Each row within a feature file begins with the filename for the keyframe, based on its MD5-hashed video identifier plus the keyframe number, then the feature name, then the number of dimensions, followed by the feature data. Rows end with newline (\n). There are no column headers.

Tools: See information for static images.

Status:

  • YLI-GEO: Gist features on keyframes are available only for (most of) the videos in the YLI-GEO subcorpus.

SIFT Features (Scale-Invariant Linear Transform): Video Keyframes

Directory Organization: The features/keyframe/sift directory contains a set of gzipped text files, each named using the first three digits of the MD5-hashed identifiers of the videos described within. For example, the file sift/01a.sift.gz contains the SIFT descriptors for the keyframes from videos 01a107a8832b9aad4df749d16faaea2 through 01aeb60fca4753f40cbd3051f685.

File Layout: Each row within a feature file begins with the filename for the keyframe, based on its MD5-hashed video identifier plus the keyframe number. The remainder of the row follows the format described above for static images. There are no column headers.

Tools: See information for static images.

Status:

  • YLI-GEO: SIFT features on keyframes are available only for (most of) the videos in the YLI-GEO subcorpus.

CaffeNet/AlexNet Features: Video Keyframes

Directory Organization: The features/keyframe/alexnet/ directory contains three gzipped tar archives, each containing the data for one type of extracted feature:

keyframe/alexnet/fc6.tgz: Layer 6.
keyframe/alexnet/fc7.tgz: Layer 7.
keyframe/alexnet/softmax.tgz: Posterior probabilities.

(For best results, use tar -xzf to unzip and untar the archives.) Within each archive directory, you will find a set of text files, each of which is named using the MD5-hashed identifier of the video described within, along with the abbreviated feature name. For example, the file alexnet/fc6/1039a3504a1b69508b718e46e44f68.fc6.txt contains the Layer 6 CaffeNet output for the keyframes from video 1039a3504a1b69508b718e46e44f68.

File Layout: Each row within a feature file contains the data for one keyframe. There are no column headers nor row headers.

Tools and Parameters: Extracted using the Caffe utility. The pretrained CaffeNet model is a slight variant on AlexNet. We extracted Layer 6, Layer 7, and posterior possibilities.

Status:

  • YLI-MED: CaffeNet/AlexNet features on keyframes are available only for (most of) the positive-example videos in the YLI-MED subcorpus.