Preparing a Multimedia Commons Expansion Pack

Thank you for contributing to the Multimedia Commons!

To help the process go smoothly, we’ve put together a set of guidelines to help you prepare your data for distribution via AWS. In addition to the guidelines, we suggest you poke around in the Multimedia Commons data store to see what the existing expansions look like.

Small- to Medium-Sized Expansion Packs

Small- to medium-sized expansions (up to 50GB) can be saved as a single file, using the following specifications:

  1. Each line in the file should represent one photo/video from the YFCC100M dataset.
  2. Each line may have one or more tab-separated fields, where:
    • The first field contains the photo/video md5 hash (use the index yfcc100m_hash.bz2 that accompanies the YFCC100M to convert a photo or video ID to its md5 hash).
    • The remaining fields contain your annotations and/or computed data for that photo/video.
  3. Sort the lines by the order in which they appear in the original dataset (use yfcc100m_lines.bz2 to look up the line number associated with a photo or video ID).
  4. If you don’t have any data/annotations for a photo/video, make sure its md5 hash is still included in the file (so there are no missing lines; e.g., a photo on line number 123,456 in the dataset can still be found at line number 123,456 in your expansion pack), and just leave all the other fields empty. (If your data/annotations apply only to a particularly small subset of the YFCC100M — for example, if they are relevant only for the videos — please contact us for further guidance.)
  5. If your computed data contains freeform text, please URL-encode it (with a “+” for a single space instead of “%20”); this will make it still readable, but won’t break the file format with spurious tabs and newlines. If your computed data contains binary data, please Base64 encode it.
  6. If one of your fields contains multiple items, you can comma-separate them (e.g., apple,cat,outdoor). If they contain key-value pairs, you can colon-separate them (e.g., apple:0.745,cat:0.234,outdoor:0.955).

Large Expansion Packs

For larger expansion packs, such as big extracted features, it’s best to save each feature as a separate file. (Smaller expansions may also use this format, where convenient.) The specifications in that case are:

  1. Name each file in the format [md5 hash].[extension], where you choose the extension. (Use the index yfcc100m_hash.bz2 that accompanies the YFCC100M to convert a photo or video ID to its md5 hash.)
  2. The files should be organized in a two-level subdirectory structure: [1st three characters of md5 hash]/[2nd three characters of md5 hash]/[full md5 hash].[extension]. (Not all at the same level.) So, for example, the frabjosity matrix for the image with ID 004ff2d14dd0225c465c48b428361c would be found in frab/004/ff2/004ff2d14dd0225c465c48b428361c.frab.
  3. Fields should be tab-separated.
  4. Use URL encoding for freeform text or Base64 for binary data; comma-separate or colon-separate items within fields. (See items 5 and 6 above.)

Documentation

To accompany your expansion pack, we would like:

  • A README that explains what this is, how the data is organized within the files (field names and so on), relevant parameters for feature computation, etc.
  • A description to put on the Multimedia Commons website (including information about the preferred citation). (See the Other Feature Corpora page for examples.)
  • Any accompanying scripts/utilities/tools that are needed to interpret the data.

Contact

Questions may be directed to Jaeyoung Choi at jaeyoung[chez]icsi[stop]berkeley[stop]edu or Bart Thomee at bthomee[chez]yahoo-inc[stop]com.