The YFCC100M is the largest publicly and freely useable multimedia collection, containing the metadata of around 99.2 million photos and 0.8 million videos from Flickr, all of which were shared under one of the various Creative Commons licenses.
Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved. An email with access instructions should be sent to you within the hour, but it is known these occasionally are marked as spam so if you receive nothing please email them directly.
Accessing the YFCC100M: The dataset is hosted by Amazon in an S3 data bucket. The size of the dataset is around 15GB and is stored as a single Bzip2-compressed file named
yfcc100m_dataset.bz2. At the moment you will need to have an AWS account to download the file from the bucket, although Webscope is working to find a solution so you can get the dataset without needing one. In particular, a credit card is required to create an AWS account (even though it will never be charged when you just download data from a bucket), and in many countries it is not common to have such a card.
Using the YFCC100M: Once you have downloaded the dataset, you can directly read its contents using command line tools such as
bzcat. In many situations, however, it would be much easier and faster to launch an EC2 instance and install a Hadoop cluster to efficiently access and process the dataset directly from the S3 bucket. To avoid the hassle of installing and maintaining Hadoop yourself, you can also launch an EMR cluster. Naturally, using an EC2 instance or an EMR cluster will not be free, but you will get convenience in return.
License: Use of the dataset is subject to the relevant Webscope License Agreement, which you need to agree to before being granted access to the dataset.
Citation: If you decide to use the dataset, please cite the following article that describes it in detail: B. Thomee, D.A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li, “YFCC100M: The New Data in Multimedia Research”, Communications of the ACM, 59(2), 2016, pp. 64-73. The article is open access, so you can get it free of charge here.
Acknowledgements: The YFCC100M was produced at Yahoo Labs by Bart Thomee and David Ayman Shamma, with the assistance of Li-Jia Li and Nikhil Rasiwasia, as well as the Yahoo Webscope and Legal teams. An interview with the creators is available on Vimeo.
Contact: If you have questions about the YFCC100M dataset, please email us.
The original images and videos indexed in the YFCC100M may be found in the
data/ directory on the
multimedia-commons S3 data store on AWS Public Data Sets. This directory has 99,171,688 image files and 787,479 video files. The videos add up to around 8,081 hours, with an average video length of 37s and a median length of 28s.
Directory Organization: The original images indexed in the YFCC100M are in
data/images/. The videos are in
data/videos/mp4/, and keyframes extracted from the videos (one per second) are in
data/videos/keyframes/. Within the directory for each data type, you will find a set of subdirectories named according to the first few digits in the MD5-hashed media identifiers of the files within them. So, for example, the original MPEG-4 file for the video with identifier 01dbe4b8aa4987fadc5ce7099e7 will be found at
Resizing: The maximum size for images in the
data/images/ directory is 500 pixels (long dimension), although smaller and larger versions can be downloaded directly from Flickr by constructing the URL based on the media object’s identifier (see here for exact instructions).
Licenses: Use of the original media files is subject to the Creative Commons licenses chosen by their creators/uploaders. License information for each media file can be found within the YFCC100M metadata.
Missing Media Files: The YFCC100M metadata file includes metadata for 5,957 videos and 34,876 images that are not available in the Multimedia Commons
data/ directory on AWS. These videos and images were taken down by their owners between the time the YFCC100M index was compiled and the time the media files were gathered. We hope to release an index of the missing files soon. In addition, there are some videos for which the original MPEG file is in
data/videos/mp4/, but we have not yet extracted keyframes. The missing keyframes will be added soon.
Preferred Citation: Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The New Data in Multimedia Research. Communications of the ACM 59(2), pp. 64-73. Available online.
Cheers to: Jaeyoung Choi, Carmen Carrano, Karl Ni, David A. Shamma, Bart Thomee, the Yahoo Webscope and Legal teams — and the millions of Flickr users who contributed to the dataset!
Contact: If you have questions about the downloaded photos and videos, please email us.