A Yahoo-Flickr Grand Challenge on Tag and Caption Prediction!

The challenge has completed – thank you all for your participation!

The members of the Flickr community manually annotate photos with the goal of making them searchable and discoverable. With the advent of cameraphones and auto-uploaders, photo uploads have become more numerous and asynchronous. Yet, manual annotation is cumbersome for most users, making it difficult to find relevant photos. Researchers have turned to devising visual concept recognition algorithms to enable automatic annotation of photos.

Progress in visual recognition has been largely driven by training deep neural networks on datasets, such as ImageNet, that were built by manual annotators. However, acquiring annotations is expensive. In addition, the different categories of annotations are defined by researchers and not by users, which means they are not necessarily relevant to users’ interests, and cannot be directly leveraged to enable search and discovery.

The Tag and Caption Prediction challenge focuses on how people annotate photos, rather than on how computers annotate photos. Participants are asked to build image analysis systems that think like humans: the correct annotation for an image isn’t necessarily the “true label”. For example, while a photo containing an apple, a banana and a pear could be annotated using these three words, a person would more likely annotate the image with the single word “fruit”.

The results of this challenge will be presented at the ACM International Conference on Multimedia, held from 15-19 October 2016 in Amsterdam, Netherlands.

1. Tasks

To explore the rich nature of human annotation, we support the following two subtasks.

1.1 Tag Prediction

This subtask focuses on predicting the tags (i.e. keywords) that a user annotated a photo with. Since tags are free-form, predicting the correct tags from a virtually endless pool of possible tags is extremely challenging. To this end, this subtask only focuses on a specific subset of tags, namely those that (i) are in the English dictionary, (ii) do not refer to persons, dates, times or places, and (iii) occur reasonably frequently. Specifically, we started out with the most frequent 10,000 user tags matching words in the English dictionary, and discarded all that referred to persons, dates, times or places; were different tenses/plurals of the same word; were otherwise obscure or ambiguous; and occurred 50 times or more in the test set. This yielded a total of 1,540 tags, which can be found here.

The following metrics will be used to evaluate the tag predictions on the test set:

  • Precision@K: proportion of the top K predicted tags that appear in the user tags.
  • Recall@K: proportion of the user tags that appear in the top K predicted tags.
  • Accuracy@K: 1 if at least one of the top K predicted tags is present in the user tags, 0 otherwise.

We will test the values of K: 1, 3, 5.

1.2 Caption Prediction

This subtask focuses on predicting the caption (i.e. title) that a user annotated a photo with. We encourage you to produce image captions that are not only accurate but are also attractive, while non-informative captions are less preferred. The long-term goal of this subtask is to build machines that can not only understand what is shown in the photo, but can also associate emotions and feelings with the photo like a human being. Similar to the tag prediction subtask, we only focus on photos having a caption that contains words in the English dictionary. The list of 179K words in the dictionary we used can be found here.

The following metrics will be used to evaluate the caption predictions on the test set:

  • Automatic evaluation metrics based on BLEU/METEOR scores. These metrics measure the differences in the generated sentences and the original captions on the whole test set.
  • Manual evaluation metrics based on human judgments. A group of human judges will read the predicted captions for a subsample of the test set, and choose the best performing system.

2. Dataset

The challenge will use data exclusively drawn from the Yahoo-Flickr Creative Commons 100M (YFCC100M) dataset. The benefit of this dataset is its sheer volume and that it is freely and legally usable. The metadata, pixel data, and a wide variety of features are stored on Amazon S3, meaning that it can be accessed and processed directly on the cloud; this is of particular importance to potential participants that may not have access to sufficient computational power or disk storage at their own research lab.

To get access to the YFCC100M data, please click here for more details.

We have split the data into two groups depending on the last digit prior to the @-symbol in the Flickr user identifier (NSID). The motivation to split the data this way, is to ensure no user occurs in multiple partitions, thus avoiding dependencies between the different splits. The whole dataset has been split into 10 parts, where split 0 is used as the test set, and the others as the training set.

2.1 Tag prediction: training data
The training data for the tag prediction task can be found here. The training data contains all photos from the YFCC100M that have at least one tag that appeared in our master list of 1,540 tags. Each line contains four tab-separated fields: (1) data marker, which for the training set is always set to 0, (2) photo identifier, (3) user identifier, (4) comma-separated list of one or more tags from the master list. You can look up each photo in the YFCC100M dataset using the identifier in order to obtain more metadata, or to access the features, annotations and pixels that are stored in the MMCommons S3 bucket (see the aforementioned link).

2.2 Tag prediction: test data
The test data for the tag prediction task can be found here. The test data contains only those photos that remained after applying the earlier mentioned selection procedure. It contains the same fields as the training data, except where the first field is always set to 1, and the last field has been removed since the tags are the ones you need to predict! FYI: each photo as at least 5 tags from the master list.

2.3 Caption prediction: training data
The training data for the caption prediction task can be found here. The training data contains all photos from the YFCC100M that have at least one word that appeared in our English dictionary. Similar to the tag prediction task, each line contains four tab-separated fields: (1) data marker, which for the training set is always set to 0, (2) photo identifier, (3) user identifier, (4) normalized sentence of words. Here, we preprocessed the original caption by stripping out all punctuation characters, and lowercased it for consistency.

2.4 Caption prediction: test data
The test data for the caption prediction task can be found here. The test data contains only those photos that remained after applying the earlier mentioned selection procedure. It contains the same fields as the training data,, except where the first field is always set to 1, and the last field has been removed since this time it’s the caption that you need to predict. FYI: each photo has a caption that contains only and at least 5 words from the dictionary. While the BLEU evaluation script doesn’t care about punctuation and casing, for the manual evaluation using professional editors this is important, so it’s best if your algorithm produces valid English sentences.

3. Leaderboard

We have a live leaderboard up and running. Submissions to solve the tag challenge and/or caption challenge can be uploaded to our evaluation system, which will evaluate each submission on a predefined random subset of the test set (approx. 30% for the tag challenge and 5% for the caption challenge). Once the scores have been calculated for a submission, they will be added to the leaderboard. Only a limited number of attempts can be made per day. The leaderboard is meant to give a preliminary understanding of how well an algorithm is likely to perform on the entire test set, and at the same time stimulates competition amongst participants vying for a place at the top of the list.

You can access the leaderboard here.
You can run your own leaderboard using our source code.

4. Submissions

To formally participate in this challenge, you are expected to submit a scientific paper that (i) significantly addresses the challenge, (ii) describes a working, presentable system or demo that uses the grand challenge dataset, and (iii) describes why the system presents a novel and interesting solution.

The paper submissions (max. 4 pages incl. references) should be formatted according to the ACM conference formatting guidelines. The reviewing process is double-blind so authors shouldn’t reveal their identity in the paper. The finalists will be selected by a committee consisting of academia and industry representatives, based on novelty, presentation, scientific interest of the approaches, and on the performance against the task.

5. Important dates

20 July 2016: Paper submissions due.
27 July 2016: Notification of acceptance.
3 August 2016: Camera-ready paper due.
15-19 October 2016: Conference in Amsterdam.

Organizers

Bart Thomee (Google/YouTube, San Bruno, CA, USA)
Pierre Guarrigues (Yahoo/Flickr, San Francisco, CA, USA)
Liangliang Cao (AI For Customer Service, New York City, NY, USA)
David A. Shamma (CWI, Amsterdam, Netherlands)

You can contact the organizers at: tagcaption2016@gmail.com.