A MediaEval Benchmark on Multimedia Location Prediction!

The challenge has completed – thank you all for your participation!

The placing task requires participants to estimate the locations where multimedia items (photos or videos) were captured using an algorithm that solely inspects the content and metadata of these items, and optionally may exploit additional knowledge sources such as gazetteers. The purpose of this challenge is to advance the state of the art in multimedia understanding. As an example, the developed methods could help a rescue team to infer where exactly a family disappeared in a remote area by discovering the locations shown in videos uploaded to a social network before they lost contact.

Like last year, we maintain the focus on human geography in this year’s edition of the task, where we consider not only geographic coordinates, but also geographic places like neighborhoods and cities. The placing task integrates all aspects of multimedia text, audio, photo, video, location, time, users and context.

The task is of interest to researchers in the areas of geographic information systems, multimedia information retrieval, computer vision, and social media analysis.

The results of the placing task will be presented at the MediaEval Benchmarking Initiative for Multimedia Evaluation, held from 20-21 October 2016 in Amsterdam, Netherlands.

1. Sub-tasks

The following two sub-tasks are offered:

1.1 Estimation-based placing task

Participants are given a hierarchy of places across the world, ranging from neighborhoods to countries, and are asked to pick a node from the hierarchy in which they most confidently believe the photo or video was taken. While the ground truth locations of the photos and videos will be associated with the most accurate nodes (i.e. leaves) in the hierarchy, the participants can express a reduced confidence in their location estimates by selecting nodes at higher levels in the hierarchy; we will then consider the centroid coordinate of the node as the estimate. If the confidence is sufficiently high, participants may naturally directly estimate the geographic coordinate of the photo/video instead of choosing a node from the hierarchy.

1.2 Verification-based placing task

Participants are given a set of photos and videos and each of them has a corresponding node in the hierarchy where it was supposedly captured. The participants are asked to determine whether or not a photo or video was really taken in the place the node corresponds with.

2 Data

The dataset for this year’s task is a subset of the YFCC100M collection. Once you register to participate, you will be sent instructions on how to get started and how to access the data.

2.1 Place set

We have only included those photos and videos of which the coordinates could be reverse geocoded to a place, as given by the Places expansion pack that is included when downloading YFCC100M. Photos and videos taken in or above international waters therefore are not considered either, since their locations would be challenging to accurately predict anyway.

2.2 Training and test set

The training set for the estimation-based task contains on the order of 5 million photos and 25 thousand videos, while the test set contains about 1.5 million photos and 30 thousand videos. The verification-based task will be formed by a subset of the estimation-based training and test sets. Note that no user appears both in the training set and in the test set, and to minimize user and location bias, each user was limited to contributing at most 250 photos and 50 videos, where no photos/videos have been included that were taken by a user less than 10 minutes apart.

2.3 Features

We have provided several visual, aural, and motion features to the participants so they can focus on solving the task rather than spending time on reinventing the wheel.

3. Evaluation

The evaluation of the runs submitted by participating groups will be similar to last year, although this time we will separate out the evaluation of the photos from the videos. For the estimation sub-task, we will measure the distances between the predicted and the actual geographic coordinates using Karney’s algorithm; this algorithm is based on the assumption that the shape of the Earth is an oblate spheroid, which therefore produces more accurate distances than methods such as the great-circle distance that treat the shape of the Earth as a sphere. For the verification sub-task, we will measure the classification errors using traditional classification metrics.

3.1 Baselines

We will provide several baseline methods (source code + performance evaluation) to the participants so they have a starting point. We will inform all participants when the baselines are released.

3.2 Leaderboard

We have a running leaderboard system, where participants can submit a new run every few hours and can view the performance of their algorithm, as well as the relative standing towards others as evaluated on a representative development set (i.e. part of, but not the complete, test set). Participants are not required to submit their runs to the leaderboard, and may hide their identity if they so desire.

You can access the leaderboard here.
You can run your own leaderboard using our source code.

3.3 Run submission

A maximum of five runs can be submitted for the final evaluation, which are as follows:

  1. Required You may only use the textual metadata from the training set, and no other information.
  2. Required You may only use the visual and aural data from the training set, and no other information.
  3. Required You may only use the textual, visual and aural data from the training set, and no other information.
  4. Optional Anything goes, e.g. you may crawl your own additional web material or use a gazetteer. You may not lookup the items contained in the test sets, nor any items taken by the same user within 24hrs before the first and after the last timestamp of a photo sequence in the mobility test set.
  5. Optional Same as 4.

The data, external or otherwise, must be clearly described in the accompanying working notes each participant also has to submit. If a participant has difficulty fulfilling any of the above required runs, please contact the organizers to discuss a solution.

3.4 Run format

Runs should be submitted in individual files according to the following format:
where [subtask] is the name of the subtask (i.e. ‘estimation’ or ‘verification’), [modality] is the type of media the run focuses on (i.e. ‘photo’ or ‘video’), [group] is the name of the participating group (max. 10 alphanumeric characters), and [run] is the identifier of the run (1-5). For example, we would submit the file me16pt_verification_video_organizers_2.txt for run 2 of the verification-based subtask that focuses on videos.

Each estimation run should be submitted in a single text file, with the following tab-separated fields:

[photo/video id] [longitude] [latitude] for coordinate estimates, e.g. 123456789 -60.656923 -32.956634


[photo/video id] [woeid] for place estimates from the YFCC100M places expansion pack, e.g. 123456789 5555555

Each verification run should be submitted in a single text file, with the following tab-separated fields:

[photo/video id] 0 if you think the provided place was not where it was taken, e.g. 123456789 0


[photo/video id] 1 if you think the provided place was indeed where it was taken, e.g. 123456789 1

For the estimation-based subtask, you may mix both coordinate estimates and place estimates in a single run file, as you algorithm may be more confident in some estimates than in others. For the verification-based subtask only zeros and ones are allowed.

The estimates for each photo or video can be added to the file in any order you like, but make sure to only include a photo or video once and not to leave any of them out. We will notify you if any coordinates are missing, so you can attempt to rectify the situation (if you really cannot produce an estimate we recommend you to predict the most likely location based on the training data).

If you are able to do so, we recommend you provide us coordinate estimates instead of node estimates, as this will allow us to compare your results against those from last year, since this year’s test set includes last year’s test set.

4. Recommended reading

  1. Hays, J., Efros, A. A. “IM2GPS: Estimating Geographic Information from a Single Image”. In Proceedings of the IEEE Computer Vision and Pattern Recognition Conference, 2008.
  2. Cao, L., Yu, J., Luo, J., Huang, T. “Enhancing Semantic and Geographic Annotation of Web Images Via Logistic Canonical Correlation Regression”. In Proceedings of the ACM International Conference on Multimedia, 2009, pp. 125-134.
  3. Yin, Z., Cao, L., Han, J., Zhai, C., Huang, T. “Geographical Topic Discovery and Comparison”. In Proceedings of the ACM International Conference on World Wide Web, 2011, pp. 247-256.
  4. Larson, M., Soleymani, M., Serdyukov, P., Rudinac, S., Wartena, C., Murdock, V., Friedland, G., Ordelman, R., Jones, G. J.F. “Automatic Tagging and Geotagging in Video Collections and Communities”. In Proceedings of the ACM International Conference on Multimedia Retrieval, 2011, pp. 51-54.
  5. Luo, J., Joshi, D., Yu, J., Gallagher, A. “Geotagging in Multimedia and Computer Vision – A Survey”. In Springer Multimedia Tools and Applications, Special Issue: Survey Papers in Multimedia by World Experts, 51(1), 2011, pp. 187–211.
  6. Van Laere, O., Schockaert, S., Dhoedt, B. “Georeferencing Flickr resources based on textual meta-data”. In Journal of Information Sciences, 238, 2013, pp. 52-73.
  7. Penatti, O.A.B., Li, L. T., Almeida, J., Torres, R. da S. “A Visual Approach for Video Geocoding using Bag-of-Scenes”, In Proceedings of the ACM International Conference on Multimedia Retrieval. ACM, 2012, article 53.
  8. Choi, J., Lei, H., Ekambaram, V., Kelm, P., Gottlieb, L., Sikora, T., Ramchandran, K., Friedland, G. “Human vs. Machine: Establishing a Human Baseline for Multimodal Location Estimation”. In Proceedings of the ACM International Conference on Multimedia, 2013, pp. 866-867.
  9. Kelm, P., Schmiedeke, S., Choi, J., Friedland, G., Ekambaram, V., Ramchandran, K., Sikora, T. “A Novel Fusion Method for Integrating Multiple Modalities and Knowledge for Multimodal Location Estimation”. In Proceedings of the ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia, 2013, pp. 7-12.
  10. Trevisiol, M., Jégou, H., Delhumeau, J., Gravier, G. “Retrieving Geo-location of Videos with a Divide & Conquer Hierarchical Multimodal Approach”. In Proceedings of the ACM International Conference on Multimedia Retrieval, 2013.

5. Important dates

9 May 2016: Data (development + test) released.
2 September 2016: Run submission deadline.
16 September 2016: Run results released.
30 September 2016: Working notes paper deadline.
20-21 October 2016: MediaEval workshop.


Bart Thomee (Google, San Bruno, CA, USA)
Olivier Van Laere (Blueshift Labs, San Francisco, CA, USA)
Claudia Hauff (TU Delft, Netherlands)
Jaeyoung Choi (ICSI, Berkeley, CA, USA / TU Delft, Netherlands)

You can contact the organizers at: placing2016@gmail.com.