AudioNet: Audio Annotation of Consumer-Produced Videos

Audio and multimedia researchers at the International Computer Science Institute are beginning a project to provide human-generated audio annotations for the ~800,000 videos in the YFCC100M dataset. This will be an open-source resource for the multimedia community, as part of the Multimedia Commons.

To start us off, we need input from potential users of the dataset! What are you working on? How could a massive audio-annotated video corpus help you with that?

Input Channel #1: Take the AudioNet survey!

We want your opinions! Take the survey here.

Input Channel #2: Shoot us an email!

Contact us via jbernd[chez]icsi[stop]berkeley[stop]edu.

More About the Project

AudioNet will be a corpus of audio annotations on the ~800,000 YFCC100M videos, likely focusing on audio concepts, and including annotations relevant to particular tasks like scene classification, event detection, and sentiment analysis. Audio concepts like crowd cheering or fire alarm have been used in video analysis to detect situations that may be difficult to identify using visual means alone.

Audio analysis thus provides an important complement to analysis of the visual stream in videos. Recent research in integrating these data sources to increase classification accuracy is promising. However, efforts to leverage this potential have been delayed by the fact that audio researchers do not have access to nearly the scale and variety of ground-truth corpora that can be found in the image domain.

AudioNet will be structured using existing semantic classification resources like WordNet and/or FrameNet, so it can easily be integrated with other projects that use those resources.

The project originates at ICSI with Gerald Friedland, Julia Bernd, and Jaeyoung Choi. As we plan for AudioNet, we aim to get a better understanding of the needs and priorities of audio and multimedia researchers as regards annotated audio corpora, in terms of common tasks, target domains, and preferred formats.