The AudioNet Pilot: Experimenting with Audio Annotation

The International Computer Science Institute received a planning grant from the National Science Foundation to determine what researchers analyzing environmental audio need from a human-annotated dataset, and to experiment with procedures for creating such a dataset using the ~800,000 videos in the YFCC100M. In particular, we wanted to explore how to balance the sometimes-competing goals of consistency, completeness, and complexity in labeling.

We found there is a need for large open-source datasets in which videos are annotated exhaustively for a broad set of sound events (AKA audio concepts), with precise beginning and ending timepoints for each sound event. But timepoint annotation is very time-consuming! So we set out to see what the trade-offs are.

This report provides some details about what we learned, to help others with audio dataset planning.

Summary of Dataset Spec and Procedures

Videos: We annotated a subset of the videos in the YLI-MED corpus for video-level event detection. We chose videos randomly from three events:

  • Ev106, Person Grooming an Animal, and Ev107, Person Hand-Feeding an Animal, the two most-confused events in the YLI-MED across the greatest number of previous audio experiments by various researchers, and
  • Ev104, Parade, (one of) the least-confused with either of the other two across the greatest number of previous audio experiments.

Labels: For the tasks where we used a pre-set list of labels, we drew them from the Audio Set ontology of sound events developed at Google Research.

Annotation Tool: Annotators used ELAN, a linguistic transcription tool that produces XML output files. It provides a visual waveform representation that annotators can click on to mark the beginning and end timepoints of a label application (cf. Seeing Sound).

Annotators: To allow for easy feedback (both ways) and flexibility, annotation was performed by two in-house annotators. For each video or snippet, annotators recorded the time it took them to annotate it.

Phases: The annotation was divided into three tasks, with different label sets and procedures. Each is described below, along with some findings and open questions.

Procedures for Each Task

Task 1 Procedures

  • Took a random 10s snippet from each video to be annotated.
  • 64 snippets were annotated, including 11 done by both annotators.
  • Total of ~246 label applications (~3 per snippet).
  • Freehand: Annotators were instructed to assign a descriptive label to each and every sound (“Try to be specific, but use a label you think it’s reasonably likely someone else might use as well”).

Task 2 Procedures

  • Drew from the same set of 10s snippets used in Task 1; some were assigned to the same annotator and some crossed annotators.
  • 85 snippets were annotated, including 5 done by both annotators.
  • Total of 429 label applications (~5 per snippet).
  • Used a pre-specified label set consisting of most of Audio Set (662 labels) plus ‘Other’/fill-in-the-blank. Annotators were instructed to label every sound.

Task 3 Procedures

  • Annotated whole videos.
  • 36 videos were annotated, including 6 done by both annotators.
  • Total of 430 label applications (~5 per video).
  • Used a narrow pre-specified label set drawn from Audio Set (28 labels, some slightly modified for clarity of scope). The labels were chosen because they were applied to Task 2 snippets from more than one event (this rubric was chosen arbitrarily). Annotators were instructed to ignore sounds not in the label set.
  • As we were not able to provide clips with blank silence inserted between snippets or (better yet) randomized order of snippets within the video (cf. CHiME-Home), annotators were instructed to take frequent mental breaks, to try to avoid fatigue and perceptual memory effects.

How Long Did It Take?

Time Multipliers for Task 1 (10s snippets, freehand labels)

  • The average multiplier was 62x across all snippets.
  • Discarding the first three snippets coded by each annotator (as they were learning the procedures), the average multiplier was 54x.
  • Wide variation between annotators (42x vs. 72x).
  • Average 136s per label application.

Time Multipliers for Task 2 (10s snippets, full Audio Set labels)

  • The average multiplier was 62x across all snippets.
  • Discarding the first three snippets coded by each annotator, the average multiplier was 57x.
  • Wide variation between annotators (52x vs. 90x).
  • Average 101s per label application.

Time Multipliers for Task 3 (full videos, limited Audio Set labels)

  • The average multiplier was 21x across all videos.
  • Discarding the first three snippets coded by each annotator, the average multiplier was 18x.
  • Relatively minor variation between annotators (18x vs. 21x).
  • Average 52s per label application.

Note: Time differences between our tasks may, of course, be due to practice as well as the nature of the tasks and the ways they were implemented. Time differences between our experiments and those performed by others may be due to implementation (crowdsourcing vs. in-house, choice of tool), to choices about labeling (adding complexity with timestamps, unconstrained or large constrained label sets), or to the nature of the data (consumer-produced, visual and audio streams).

Accuracy and Agreement

Caveat – No Stats: Unfortunately, within the time available for the project, we could not systematically align the files for videos/snippets labeled by both annotators and calculate agreements for their labels and timepoints. So our comments here are impressionistic.

Accuracy of Labeling

  • Accuracy spot-checks against an expert listener found that annotators frequently missed miscellaneous clunks and camera noises (and sometimes more content-type sounds as well). Accuracy improved somewhat after giving feedback, but omissions were still frequent throughout all tasks (very few exhaustive annotations).
  • The slower annotator was slightly more accurate at the beginning, but the faster annotator improved more in response to feedback and became the more accurate one.
  • For Task 3 (small label set), annotators occasionally applied inapt labels where there was no appropriate label in the limited set (i.e. where they should not have labeled at all), but rarely.
    • It is not clear whether this might have been more of a problem if the annotators had not previously dealt with the full Audio Set, nor whether they would have been able to make those decisions so quickly.
  • For Task 3, accuracy and agreement did not notably deteriorate over the course of the full video.

Agreement on Whether/When to Label

  • Inter-annotator agreement improved somewhat between Task 1 and Task 2, and slightly again between Task 2 and Task 3.
    • It is not clear to what degree improvements were due to feedback and practice vs. the different nature of the tasks.
  • Across all three tasks, the most frequent type of disagreement between annotators seems to have been one catching a sound that the other didn’t.
  • Another frequent source of disagreement was in chunking, i.e. whether to annotate something as one longer instance of a sound vs. several shorter instances. This was especially noticeable with human speech sounds. (Annotators had been instructed, “If it belongs to a unified action, it should be labeled as a single sound event,” with examples like “Perceivable pauses in conversation make for boundaries between sound events” [cf. CASA2009].)
  • Comparing within the same annotator for the same snippet/video across different tasks, each tended to catch and annotate more sounds in Task 2 than Task 1, and (where there were labels) more in Task 3 than Task 2. However, they not infrequently failed to catch sounds in later tasks that they had caught before.

Agreement on Label Choice

  • Spot-checking Task 1 (freehand labeling), where annotators caught the same sounds and chunked them the same way:
    • The labels often did not align precisely, but were usually conceptually equivalent.
    • The annotators varied somewhat on granularity, but were more similar to each other than to Audio Set in terms of when they chose to get granular (see below).
  • Unsurprisingly, agreement between annotators was better on Task 2 (using Audio Set) than Task 1 (freehand).
    • However, disagreements still arose where Audio Set had multiple labels that might fit (e.g. ‘woman speaking’ vs. ‘narration’).
    • In some cases, this interacts with the problem (see above) of whether to annotate something as several instances of a sound or one long instance (e.g. ‘hubbub’ vs. multiple instances of ‘conversation’).
  • For both Task 1 and 2, annotators very frequently disagreed on whether to label music by dominant instrument, genre, or role, or if they labeled by genre, which one. (There were no musical labels in Task 3.)

Timestamp Agreement

  • A spot-check of timestamps shows the two annotators were usually within 150ms of each other (often under 50ms), though sometimes up to 500ms apart. (For all tasks.)
  • Comparing between tasks for the same annotator, timestamps tended to be reasonably close to their own previous annotation (the slower annotator was usually under 50ms difference, and almost all under 100ms; the faster annotator was more variable but usually less than ~200ms difference).
  • The inter- and intra-annotator timestamp differences were similar for the onsets and the offsets of the sounds.

Other Comments and Suggestions

Freehand Labeling vs. Constrained Label Sets

  • Comparing the freehand labels against Audio Set:
    • Labels sometimes matched Audio Set for simple/common sounds; more often, they didn’t match precisely, but were conceptually reasonably equivalent.
    • On the other hand, they often were more fine-grained than Audio Set, capturing features that are very distinctive with regard to sound sources, contacted objects, and activities (e.g. ‘sheep footsteps’ or ‘footsteps in grass’). It is, of course, an open question whether this nuance is actually useful.
  • However, in contrast to the granularity the annotators (sometimes) favored for Task 1, they did not tend to use the ‘Other’ label to get specific in Task 2. They were somewhat more likely to use a parent label if there was no accurate child (Audio Set is constructed to encourage this) or to use a slightly inapt label.
    • For example, for a specific animal sound that wasn’t in Audio Set, such as dogs chewing, they might use the higher-level ‘dog’ label, they might apply the human equivalent ‘chewing’ (under Human Sounds > Digestive), or perhaps they might use Other + a new label.
    • Given that animal sounds are important and distinctive for two of the events we chose, it would probably have been helpful to provide specific instructions about this domain.
  • The time multipliers were longest for Task 2 (full Audio Set). It is unclear how much this could have been improved by an interface that made better use of Audio Set’s hierarchical structure (rather than simply using underscores in the label list).

Relationship Between Freehand and Constrained Labeling

  • The Audio Set ontology was designed to be general, with the intention that it could be leafed out to include more granular labels for projects in specific domains. For example, for our purposes, i.e. distinguishing scenes involving feeding an animal from scenes involving grooming an animal, we would likely want to include a bigger set of animal sounds (Audio Set’s current coverage is mostly vocalizations).
  • To develop an actual dataset out of freehand labels, more structure and constraints would of course be required to encourage agreement (cf. the ESP game or MajorMiner). However, if one is beginning with a general ontology like Audio Set and one’s goal is simply to leaf out a specific domain, a quick round of freehand labeling (either with agreement-driving mechanics or a bit of ex machina post-processing) would probably be sufficient to generate a list of good labels to add.

Hindsight Advice: How We Could Have Made Better Use of Our In-House Annotators

  • If possible, recruit annotators with practice in very detailed audio annotation — for example, via a linguistics background.
  • If you need three annotators, hire four.
  • Give potential annotators a speed and accuracy test before hiring.
  • Perform accuracy checks early and often.
  • Provide the gold-standard annotations for the annotators to check their own work against. (Rather than simply listing omissions.)
  • Have annotators review each others’ annotations for snippets/videos that were assigned to both. Have them discuss potentially-addressable discrepancies like when to count sounds as one long vs. multiple short sounds.

ELAN as a Tool for Environmental Audio Annotation on Videos

  • ELAN is more complex than necessary for the purpose; the combination of complex software and (somewhat) complex task required extra overhead time for annotators to learn.
  • ELAN is desktop software, so would require significant work to add a crowdsourcing front-end.
  • Has multi-track transcription, which can be used to separate overlapping sounds. However, it is not really intended to be used that way, and our kludge was not compatible with ELAN’s internal functions for aligning and comparing different annotators’ labels.
  • Can play the visual track from videos (requires an extracted audio track to produce the waveform). This differentiates it from SONYC’s AudioAnnotator, the main other tool we evaluated. (On the other hand, AudioAnnotator is inherently designed for crowdsourcing.)
  • No tool we identified had a good way to leverage Audio Set’s (or any other label set’s) hierarchical structure to make it easy to find the most appropriate label.

For More Information…

Many more details about what we did, why we did it, and what happened are available in our project-internal docs and instructions! We’d be happy to share them with anyone interested. The raw data is not suitable for publication, but again, we can share the files if you are interested. Contact jbernd[chez]icsi[stop]berkeley[stop]edu.

Credits and Acknowledgements

PIs: Gerald Friedland, Julia Bernd

Advisers: Jaeyoung Choi, Benjamin Elizalde

Undergraduate Research Assistants and Annotators: Sofea Dil, Yifan Li, Daniel Ma, Abhishek Sharma, Meg Tanaka

Additional Acknowledgments: Thanks are due to the many audio researchers who provided input and suggestions for structuring the dataset. Thanks as well to the Audio Set team at Google Research and to the SONYC lab at New York University, who were simultaneously investigating other approaches to audio annotation; we learned much from their results.

Funding: The AudioNet pilot project was supported by a planning grant from the National Science Foundation, CNS-1629990. (Findings and opinions in this report are the authors’, and do not necessarily reflect the views of the NSF.)