Getting Started

The Multimedia Commons data and related resources are stored on Amazon S3 (Simple Storage Service), in the multimedia-commons data bucket. This page explains how you can use Amazon Web Services and other tools to access the data, download it, or work with it directly in the cloud.

Jump to:
Browsing the Data
Downloading the Data
Copying the Data
Mounting the Data
Using our CloudFormation Template
Attaching an EBS Volume to an EC2 Instance
Setting Up Solr to Search the Data
Sample image: Arrow made of multicolored leaves

Browsing the Data

You can explore the Multimedia Commons data through our data portal.


Downloading the Data

You can download the whole dataset or a portion of it to your local hard drive or to an AWS EBS volume attached to AWS EC2 instance. You need to provision an EBS volume with an adequate size to copy the data into.

If you just want to download individual files, you can also use our data portal to navigate to individual items and then save them directly to your computer. However, if you want to download larger batches of files, you probably will want to use specialized utilities. There are many tools out there that can be used for this, so try them out and use whichever you feel comfortable with. For example:

  • s3cmd is useful for simple operations such as listing contents, downloading small numbers of files, etc. An example s3cmd command:
    s3cmd ls s3://multimedia-commons/images
  • s3gof3r provides parallelized streaming access to Amazon S3, which is especially useful for downloading large files. An example s3gof3r command:
    gof3r get -b multimedia-commons -k images/ -p ./

Copying the Data

If you’re already using S3, you can also copy the data to your own S3 bucket.

Step 1: To copy directly from a source bucket (we’ll call it src_s3_bucket) to a destination bucket (we’ll call it dst_s3_bucket), use the cp command of Amazon’s own aws command line utility:

  • aws s3 cp s3://src_s3_bucket/ s3://dst_s3_bucket/ --recursive

The --recursive flag specifies that ALL files must be copied, with the same directory structure as the original.

For example, to copy every MFCC20 audio feature file to your bucket, using the same directory structure, use something like this:

  • aws s3 cp s3://multimedia-commons/feats/audio/mfcc20/ s3://dst_s3_bucket/feats/audio/mfcc20 --recursive

Step 2: You can check the contents of the destination bucket using one of the following commands:

  • aws s3 ls s3://dst_s3_bucket (lists all items — which may be a lot!)
  • aws s3 ls s3://dst_s3_bucket | wc -l (returns the number of files)

Mounting the Data

If you have an account with Amazon’s EC2 (Elastic Cloud Computing) service, you can launch an EC2 instance and mount our S3 bucket that contains the Multimedia Commons. Then you can work in your EC2 instance without having to download the data at all (and thus without incurring the associated storage costs). The instructions below work whether you are mounting the s3://multimedia-commons bucket or mounting your own bucket, in case you made a copy of our data. These instructions assume you already have an EC2 instance running. See the next section for instructions on using the Multimedia Commons CloudFormation Template to launch an EC2 instance with some of our tools pre-installed, or see the AWS EC2 documentation for more general instructions.

Step 1: If you just launched a new EC2 instance, update the system first.

  • For Amazon Linux, CentOS or Red Hat:
    sudo yum update
  • For Ubuntu or Debian:
    sudo apt-get update

Step 2: Install needed dependencies.

  • In Amazon Linux, CentOS or Red Hat:
    sudo yum install gcc libstdc++-devel gcc-c++ fuse fuse-devel curl-devel libxml2-devel openssl-devel mailcap
  • In Ubuntu or Debian:
    sudo apt-get install build-essential gcc libfuse-dev fuse-utils libcurl4-openssl-dev libxml2-dev mime-support build-essential libcurl4-openssl-dev

Step 3: Download the latest s3fs package to a local directory on your EC instance, untar it by executing tar -xvzf s3fs-1.74.tar.gz, enter the extracted s3fs directory by typing cd s3fs-1.74cted s3fs directory, and finally compile it:

  • ./configure --prefix=/usr
  • make
  • sudo make install

Step 4: Create an access key and secret key from the AWS console, if you haven’t done so yet. The Security Credentials page shows your access key and your secret key (click Show to make the secret key visible).

Step 5: Save the access key and secret key to your EC2 instance.

  • Create a new file in your /etc directory with the name passwd-s3fs, e.g. using the text editor vim.
  • Copy the access key and then the secret key from the Security Credentials page and paste them into the file with a colon in between them (no space): accesskey:secretkey. Do not hit enter on your keyboard after adding the keys.
  • Save the file and exit.
  • Update the permissions on your password file: chmod 640 /etc/passwd-s3fs.

Step 7: Create a directory in which to mount the S3 bucket, for instance in your home directory.

  • mkdir ~/my_s3_bucket

Step 8: Mount the S3 bucket and make its contents accessible within the directory you just created.

  • s3fs BUCKETNAME ~/my_s3_bucket

where you should replace BUCKETNAME with the name of the bucket you want to mount, e.g.  multimedia-commons for our bucket. If you get an error that the s3fs utility cannot be found, add the complete path to where it is stored. You can find out where exactly on your file system the s3fs utility is stored by executing the command which s3fs.

Step 9: Check that the S3 bucket was successfully mounted.

  • df -Th ~/my_s3_bucket  

Note that s3fs will always return 256TB as the size of the disk. If you are trying to access a bucket you created yourself that contains data and this data is not visible, then you need to adjust the permissions in the access control list (ACL) for the bucket so that it can be read. You can do this using the AWS management console, see here for more information.


Using our CloudFormation Template to Launch EC2

AWS CloudFormation Templates provide a simplified way of launching an EC2 instance with an environment already set up for a particular task — in this case, working with the Multimedia Commons dataset. Here, we explain how to launch an EC2 instance with the Multimedia Commons template we prepared. Our template also pre-installs the audioCaffe analysis tool for you.

Step 1: Download the Multimedia Commons CloudFormation Template.

Step 2: In your AWS Management Console, click CloudFormation.

Step 3: Click Create New Stack.

Step 4: Choose Upload a template to Amazon S3 and upload the Multimedia Commons template you downloaded in Step 1.

Step 5: Enter the name of the stack in the “Stack Name” field.

Step 6: In the “Parameters” section, enter the name of the key pair for the EC2 instance in the “KeyName” field. You can create a new key pair or use one you’ve already created.

Step 7: Choose an EC2 instance type using the “audioCaffeInstanceType” field, according to the level of computing power you need. A list of instance types can be found on the Amazon EC2 Instances page. Click Next.

Step 8: On the Options page, you can specify key-value pairs for describing the stack. Click Next.

Step 9: Review the settings and click Create. Wait for the status of your stack to change to CREATE_COMPLETE.

Step 10: Navigate to the EC2 console. You will see that two EC2 instances, “audioCaffe Server” and “NAT (audioCaffe VPC)”, have been created as parts of the new stack. When its State is “running”, you can connect to the “audioCaffe Server” instance using the key pair you supplied in Step 6.

ssh -i KEYPAIRNAME.pem ubuntu@IPADDRESS

(Where KEYPAIRNAME should be replaced with the key pair name you entered in Step 6 and IPADDRESS should be replaced with the IP address of the audioCaffe server.)

Once connected, you will find that audioCaffe is already installed under the home directory. See the audioCaffe description for more details.

Click here for additional CloudFormation documentation from AWS.


Attaching an EBS Volume to an EC2 Instance

Step 1: In the EC2 console, click Volumes under “Elastic Block Storage” (in the left-hand menu).

Step 2: Click Create Volume. In the pop-up window, change the “Size” of the volume (in GiB). Change the values of other fields if needed. Click Create.

Step 3: Click on the volume you created, then in the Actions drop-down menu, click Attach Volume.

Step 4: In the “Instance” field, you can choose from a list of EC2 instances. For example, you can click audioCaffe Server if you want to attach the volume to your EC2 instance created using the Multimedia Commons CloudFormation Template.

Step 5: Log into the EC2 instance, then mount the EBS volume according to these instructions.


Setting Up Solr to Search the Data

Due to its sheer size, it is easiest to browse the contents of the dataset using a relational database or search platform, such as MySQL or PostgreSQL for basic SQL-style querying, or Apache Solr if you want more powerful search capabilities. An alternative public search possibility is the Multimedia Commons Search. However, it is still under development.

You can either download a search index or set up your own search. First, we describe how to set up solr with a downloaded search index.

Step 1: Download Solr from the Solr Resources page. (This page also has extensive information and tutorials about how to use Solr in general.)

Step 2: Use rclone to download the precompiled search index. A normal download is not possible, because one file has the size of more than 35GB.

Step 3: Move the index to PATH/server/solr. (Where ‘PATH‘ should be replaced with the path of the directory where you downloaded Solr.)

Step 4: Start the server.

  • PATH > /bin/solr start

Due to the large size of the search engine, memory issues might occur because the JVM is started with a very low amount of resources. In this case, increasing memory might be beneficial (PATH > ./bin/solr restart -m 12g ).PATH > ./bin/solr restart --help can be used to get further start options for optimization. Under the solr dashboard at http://localhost:8983/solr/#/  memory consumption and other properties can be checked at run time.

Step 5: Open your web browser and connect to the web console:

  • localhost:8983/solr.

Step 6: Happy Searching!

Here, we explain how to create the database with Apache Solr.

Step 1: Download Solr from the Solr Resources page. (This page also has extensive information and tutorials about how to use Solr in general.)

Step 2: Under PATH/server/solr, create a directory for the new core. (Where ‘PATH‘ should be replaced with the path of the directory where you downloaded Solr.) For example, you might create a new directory called yfcc100m.

Step 3: Download the MMC search sample schema and config file and put it in the new core directory.

Step 4: Start the server.

  • PATH > /bin/solr start

Due to the large size of the search engine, memory issues might occur because the JVM is started with a very low amount of resources. In this case, increasing memory might be beneficial (PATH > ./bin/solr restart -m 12g ).PATH > ./bin/solr restart --help can be used to get further start options for optimization. Under the solr dashboard at http://localhost:8983/solr/#/  memory consumption and other properties can be checked at run time.

Step 5: Open your web browser and connect to the web console:

  • localhost:8983/solr.

Step 6: If the core is not yet present, click on Add Core. You can name this core whatever you want, but you will need to change the “instanceDir” field to the same directory name you used in Step 2. (The core does not need to have the same name as the directory, but it can.) Leave the other three fields unchanged. If you stored the directory as suggested in Step 3, it should be already available. Otherwise, you can use the add core functionality to store the core at any other place. Note that depending on user rights, access to the core might cause conflicts or security risks.

Step 7: Populate the empty core with the data you want to work with. Solr allows you to directly upload a CSV (or any properly formatted file) to the core.

  1. Some files — for example, the metadata file for the YFCC100M dataset — need to be reformatted before they can be uploaded. To reformat the YFCC100M metadata file, you can use this custom Python script. It would be a good idea to do a test run with a small subset of the metadata file (e.g., first 10,000 lines of YFCC100M metadata file). Alternatively, you can download a reformated table where some metadata has already been added (17GB, rclone probably required).
  2. Update the core with the properly formatted file. For the YFCC100M metadata, you would use this command:
    curl 'http://localhost:8983/solr/[CORE NAME]/update/csv?commit=true&separator=%09&header=true&stream.file=[PATH TO REFORMATTED METADATA FILE]&f.usertags.split=true&f.usertags.separator=%2C&f.machinetags.split=true&f.machinetags.separator=%2Cf&overwrite=true'

    This may also take up to an hour or longer depending on the power of your CPU. If you want to learn more about what each of the parameters do, you can check out the Solr Wiki’s instructions for Updating a Solr Index with CSV. Unfortunately, uploading the CSV works only for adding new documents but not for modifying them like adding the data from extensions like autotags.

Step 8: Select the new core from the Core Selector dropdown menu.

Step 9: Select “Query”. (To retrieve the first ten documents, you can just click Execute Query.)

The results of your query will be shown in the righthand panel.

(Note that you can also use the URL bar to formulate a query. The URL syntax you could have used will appear above your query results.)

Step 10: Add the metadata from extensions like places or autotags.

Advertisements