The Multimedia Commons data and related resources are stored on Amazon S3 (Simple Storage Service), in the
multimedia-commons data bucket. This page explains how you can use Amazon Web Services and other tools to access the data, download it, or work with it directly in the cloud.
Browsing the Data
Downloading the Data
Copying the Data
Mounting the Data
Using our CloudFormation Template
Attaching an EBS Volume to an EC2 Instance
Setting Up Solr to Search the Data
You can explore the Multimedia Commons data through our data portal.
You can download the whole dataset or a portion of it to your local hard drive or to an AWS EBS volume attached to AWS EC2 instance. You need to provision an EBS volume with an adequate size to copy the data into.
If you just want to download individual files, you can also use our data portal to navigate to individual items and then save them directly to your computer. However, if you want to download larger batches of files, you probably will want to use specialized utilities. There are many tools out there that can be used for this, so try them out and use whichever you feel comfortable with. For example:
- s3cmd is useful for simple operations such as listing contents, downloading small numbers of files, etc. An example s3cmd command:
s3cmd ls s3://multimedia-commons/images
- s3gof3r provides parallelized streaming access to Amazon S3, which is especially useful for downloading large files. An example s3gof3r command:
gof3r get -b multimedia-commons -k images/ -p ./
If you’re already using S3, you can also copy the data to your own S3 bucket.
Step 1: To copy directly from a source bucket (we’ll call it
src_s3_bucket) to a destination bucket (we’ll call it
dst_s3_bucket), use the
cp command of Amazon’s own aws command line utility:
aws s3 cp s3://src_s3_bucket/ s3://dst_s3_bucket/ --recursive
--recursive flag specifies that ALL files must be copied, with the same directory structure as the original.
For example, to copy every MFCC20 audio feature file to your bucket, using the same directory structure, use something like this:
aws s3 cp s3://multimedia-commons/feats/audio/mfcc20/ s3://dst_s3_bucket/feats/audio/mfcc20 --recursive
Step 2: You can check the contents of the destination bucket using one of the following commands:
aws s3 ls s3://dst_s3_bucket(lists all items — which may be a lot!)
aws s3 ls s3://dst_s3_bucket | wc -l(returns the number of files)
If you have an account with Amazon’s EC2 (Elastic Cloud Computing) service, you can launch an EC2 instance and mount our S3 bucket that contains the Multimedia Commons. Then you can work in your EC2 instance without having to download the data at all (and thus without incurring the associated storage costs). The instructions below work whether you are mounting the
s3://multimedia-commons bucket or mounting your own bucket, in case you made a copy of our data. These instructions assume you already have an EC2 instance running. See the next section for instructions on using the Multimedia Commons CloudFormation Template to launch an EC2 instance with some of our tools pre-installed, or see the AWS EC2 documentation for more general instructions.
Step 1: If you just launched a new EC2 instance, update the system first.
- For Amazon Linux, CentOS or Red Hat:
sudo yum update
- For Ubuntu or Debian:
sudo apt-get update
Step 2: Install needed dependencies.
- In Amazon Linux, CentOS or Red Hat:
sudo yum install gcc libstdc++-devel gcc-c++ fuse fuse-devel curl-devel libxml2-devel openssl-devel mailcap
- In Ubuntu or Debian:
sudo apt-get install build-essential gcc libfuse-dev fuse-utils libcurl4-openssl-dev libxml2-dev mime-support build-essential libcurl4-openssl-dev
Step 3: Download the latest s3fs package to a local directory on your EC instance, untar it by executing
tar -xvzf s3fs-1.74.tar.gz, enter the extracted s3fs directory by typing
cd s3fs-1.74cted s3fs directory, and finally compile it:
sudo make install
Step 4: Create an access key and secret key from the AWS console, if you haven’t done so yet. The Security Credentials page shows your access key and your secret key (click Show to make the secret key visible).
Step 5: Save the access key and secret key to your EC2 instance.
- Create a new file in your
/etcdirectory with the name
passwd-s3fs, e.g. using the text editor vim.
- Copy the access key and then the secret key from the Security Credentials page and paste them into the file with a colon in between them (no space):
accesskey:secretkey. Do not hit enter on your keyboard after adding the keys.
- Save the file and exit.
- Update the permissions on your password file:
chmod 640 /etc/passwd-s3fs.
Step 7: Create a directory in which to mount the S3 bucket, for instance in your home directory.
Step 8: Mount the S3 bucket and make its contents accessible within the directory you just created.
s3fs BUCKETNAME ~/my_s3_bucket
where you should replace
BUCKETNAME with the name of the bucket you want to mount, e.g.
multimedia-commons for our bucket. If you get an error that the s3fs utility cannot be found, add the complete path to where it is stored. You can find out where exactly on your file system the s3fs utility is stored by executing the command
Step 9: Check that the S3 bucket was successfully mounted.
df -Th ~/my_s3_bucket
s3fs will always return 256TB as the size of the disk. If you are trying to access a bucket you created yourself that contains data and this data is not visible, then you need to adjust the permissions in the access control list (ACL) for the bucket so that it can be read. You can do this using the AWS management console, see here for more information.
AWS CloudFormation Templates provide a simplified way of launching an EC2 instance with an environment already set up for a particular task — in this case, working with the Multimedia Commons dataset. Here, we explain how to launch an EC2 instance with the Multimedia Commons template we prepared. Our template also pre-installs the audioCaffe analysis tool for you.
Step 1: Download the Multimedia Commons CloudFormation Template.
Step 2: In your AWS Management Console, click CloudFormation.
Step 3: Click Create New Stack.
Step 4: Choose Upload a template to Amazon S3 and upload the Multimedia Commons template you downloaded in Step 1.
Step 5: Enter the name of the stack in the “Stack Name” field.
Step 6: In the “Parameters” section, enter the name of the key pair for the EC2 instance in the “KeyName” field. You can create a new key pair or use one you’ve already created.
Step 7: Choose an EC2 instance type using the “audioCaffeInstanceType” field, according to the level of computing power you need. A list of instance types can be found on the Amazon EC2 Instances page. Click Next.
Step 8: On the Options page, you can specify key-value pairs for describing the stack. Click Next.
Step 9: Review the settings and click Create. Wait for the status of your stack to change to CREATE_COMPLETE.
Step 10: Navigate to the EC2 console. You will see that two EC2 instances, “audioCaffe Server” and “NAT (audioCaffe VPC)”, have been created as parts of the new stack. When its State is “running”, you can connect to the “audioCaffe Server” instance using the key pair you supplied in Step 6.
ssh -i KEYPAIRNAME.pem ubuntu@IPADDRESS
KEYPAIRNAME should be replaced with the key pair name you entered in Step 6 and
IPADDRESS should be replaced with the IP address of the audioCaffe server.)
Once connected, you will find that audioCaffe is already installed under the home directory. See the audioCaffe description for more details.
Step 1: In the EC2 console, click Volumes under “Elastic Block Storage” (in the left-hand menu).
Step 2: Click Create Volume. In the pop-up window, change the “Size” of the volume (in GiB). Change the values of other fields if needed. Click Create.
Step 3: Click on the volume you created, then in the Actions drop-down menu, click Attach Volume.
Step 4: In the “Instance” field, you can choose from a list of EC2 instances. For example, you can click audioCaffe Server if you want to attach the volume to your EC2 instance created using the Multimedia Commons CloudFormation Template.
Step 5: Log into the EC2 instance, then mount the EBS volume according to these instructions.
Due to its sheer size, it is easiest to browse the contents of the dataset using a relational database or search platform, such as MySQL or PostgreSQL for basic SQL-style querying, or Apache Solr if you want more powerful search capabilities.
Here, we explain how to attach the database to Apache Solr.
Step 1: Download Solr from the Solr Resources page. (This page also has extensive information and tutorials about how to use Solr in general.)
Step 2: Under
PATH/server/solr, create a directory for the new core. (Where ‘
PATH‘ should be replaced with the path of the directory where you downloaded Solr.) For example, you might create a new directory called
Step 3: Download the YFCC100M sample schema and config file and put it in the new core directory, then extract the archived files. (For best results, use
tar -xzf to unzip and untar the archive.)
Step 4: Start the server.
PATH > /bin/solr start
Due to the large size of the search engine, memory issues might occur because the JVM is started with a very low amount of resources. In this case, increasing memory might be beneficial (
PATH > ./bin/solr restart -m 12g ).
PATH > ./bin/solr restart --help can be used to get further start options for optimization. Under the solr dashboard at http://localhost:8983/solr/#/ memory consumption and other properties can be checked at run time.
Step 5: Open your web browser and connect to the web console:
Step 6: Click on Add Core.
Step 7: You can name this core whatever you want, but you will need to change the “instanceDir” field to the same directory name you used in Step 2. (The core does not need to have the same name as the directory, but it can.) Leave the other three fields unchanged. If you stored the directory as suggested in Step 3, it should be already available. Otherwise, you can use the add core functionality to store the core at any other place. Note that depending on user rights, access to the core might cause conflicts or security risks.
Step 8: Populate the empty core with the data you want to work with. Solr allows you to directly upload a CSV (or any properly formatted file) to the core.
- Some files — for example, the metadata file for the YFCC100M dataset — need to be reformatted before they can be uploaded. To reformat the YFCC100M metadata file, you can use this custom Python script. It would be a good idea to do a test run with a small subset of the metadata file (e.g., first 10,000 lines of YFCC100M metadata file).
- Update the core with the properly formatted file. For the YFCC100M metadata, you would use this command:
curl 'http://localhost:8983/solr/[CORE NAME]/update/csv?commit=true&separator=%09&header=true&stream.file=[PATH TO REFORMATTED METADATA FILE]&f.usertags.split=true&f.usertags.separator=%2C&f.machinetags.split=true&f.machinetags.separator=%2Cf&overwrite=true'
This may also take up to an hour or longer depending on the power of your CPU. If you want to learn more about what each of the parameters do, you can check out the Solr Wiki’s instructions for Updating a Solr Index with CSV. Unfortunately, uploading the CSV works only for adding new documents but not for modifying them like adding the data from extensions like autotags.
Step 9: Select the new core from the Core Selector dropdown menu.
Step 10: Select “Query”. (To retrieve the first ten documents, you can just click Execute Query.)
The results of your query will be shown in the righthand panel.
(Note that you can also use the URL bar to formulate a query. The URL syntax you could have used will appear above your query results.)
Step 11: Add the metadata from extensions like places or autotags.