Supervised Detection of Strongly-Labelled Whale Calls


Task 2 Description

Coordinators

Dorian Cazau
Olivier Adam
Olivier Adam

Sorbonne Université, LAM

Paul Carvaillo
Paul Carvaillo

France Energies Marines

Gabriel Dubus
Gabriel Dubus

Sorbonne Université, LAM

Anatole Gros-Martial
Anatole Gros-Martial

Centre d’Etudes Biologiques de Chizé, GEO-Ocean

Lucie Jean-Labadye
Lucie Jean-Labadye

Sorbonne Université, LAM

Axel Marmoret
Axel Marmoret

IMT Atlantique

Brian Miller
Brian Miller

Australian Antarctic Division

Ilyass Moummad
Ilyass Moummad

INRIA

Andrea Napoli
Paul Nguyen Hong Duc
Paul Nguyen Hong Duc

Curtin University

Clea Parcerisas
Clea Parcerisas

VLIZ

Marie Roch
Marie Roch

San Diego State University, Marine Bioacoustics Research Collaborative

Pierre-Yves le Rolland Raumer
Pierre-Yves le Rolland Raumer

IUEM

Elena Schall
Elena Schall

AWI

Paul White
Ellen White

Passive Acoustic Monitoring (PAM) is a technology used to listen to and analyze sounds in the Ocean, and has emerged as a transformative tool for applied ecology, conservation and biodiversity monitoring. In particular, it offers unique opportunities to examine long-term trends in the population growth, abundance, distribution, of different whale species. But to efficiently automate PAM data processing in this application, two major challenges need to be addressed: the scarcity of whale calls on one side, and the variability of the surrounding acoustic environment on the other side.

In this context, a supervised sound event detection task was designed, and applied to the detection of 7 different call types from two emblematic whale species, the Antarctic blue and fin whales. This task aims to improve and assess the ability of models to address the two challenges just mentioned: firstly because the calls happen only 6 % of the time, and secondly because the recordings come from different sites and time periods. To put it in a nutshell, Antarctica appeared to us as a very exciting playground to start a large-scale evaluation of model generalization capacity!

Scientific context

Antarctic blue (Balaenoptera musculus intermedia) and fin (Balaenoptera physalus quoyi) whales were nearly wiped out during industrial whaling. For the past twenty-five years, long-term passive acoustic monitoring has provided one of the few cost-effective means of studying them on their remote feeding grounds at high latitudes around the Antarctic continent.

Long term acoustic monitoring efforts have been conducted by several nations in the Antarctic, and in recent years this work has been coordinated internationally via the Acoustic Trends Working Group of the Southern Ocean Research Partnership of the International Whaling Commission (IWC-SORP). Some of the overarching goals of the Acoustic Trends Project include “using acoustics to examine trends in Antarctic blue and fin whale population growth, abundance, distribution, seasonal movements, and behaviour” [1].

Within the IWC-SORP Acoustic Trends Project relevant ecological metrics include presence of acoustic calls over time scales ranging from minutes to months. Furthermore, recent work has highlighted additional value that can be derived from estimates of the number of calls per time-period [2].

In 2020, the Acoustic Trends Project released publicly one of the largest annotated datasets for marine bioacoustics, the so-called AcousticTrends_BlueFinLibrary (ATBFL) [3]. Release of this annotated library was intended to help standardise analysis and compare the performance of different detectors across the range of locations, years, and instruments used to monitor these species. It has already been exploited in several benchmarking research papers [3][4][5].

Task definition

The task is a classical supervised multi-class and multi-label sound event detection task using strong labels, which depict the start and end time of the events. The target of the systems is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see Fig 1). In the context of the IWC-SORP Acoustic Trends Project described above, this task is applied to the detection of 7 different call types from Antarctic blue and fin whales, grouped together into 3 categories for evaluation. This task aims to challenge and assess the generalization ability of models to adapt and perform in varying acoustic environments, reflecting the real-world variability encountered in marine mammal monitoring.

Figure 1: Overview of the sound event detection task. Blue whale (Bm) D and Z calls are present, as well as Fin whale (Bp) 20 Hz Pulse with (Bp20p) or without (Bp20) overtone and its 40Hz downsweep (BpD).


Call description

The 7 calls types to detect are as follows:

  • Antarctic blue whale Z-Call (label: bmz) : smooth transition of single unit presence from 27 to 16 Hz, composed of three-parts A, B and C
  • Antarctic blue whale A-Call (label: bma) : Z-call containing only the A part
  • Antarctic blue whale B-Call (label: bmb) : Z-call containing only the B part
  • Antarctic blue whale D-Call (label: bmd) : comprises downsweeping frequency component between 20 and 120 Hz, can also comprise more frequency modulations (e.g., upsweep at start)
  • Fin whale 20 Hz Pulse without overtone (label: bp20) : comprises downsweeping frequency component from 30 to 15 Hz
  • Fin whale 20 Hz Pulse with overtone (label: bp20plus) : 20 Hz pulse with a secondary energy at variyng frequencies (80 to 120 Hz variation)
  • Fin whale 40Hz downsweep (label : bpd) : downsweeping vocalisation ending around 40 Hz, usually below 90 Hz and above 30 Hz

See more detailed information and examples of call spectrograms here.

Model development

Train and validation datasets

The overall development dataset is composed of the entire IWC-SORP ATBFL dataset, already introduced in the scientific context. As described in Table 1, it is organized in 11 site-year deployments located all around Antarctica, with time periods of recording ranging from 2005 to 2017. It contains a total of 6591 audio files, totaling 1880 hours of recording, sampled at 250 Hz.

The training dataset is composed of all site-year deployments with the exception of Kerguelen 2014, Kerguelen 2015 and Casey 2017, which have been left out from training to form the validation dataset. This makes a total of 6004 audio files for the training dataset over 8 site-year deployments, and 587 audio files for the validation dataset over 3 site-year deployments.

Deployment Number of audio recordings Total duration (h) Total events Ratio event/duration (%)
ballenyisland2015 205 204 2222 1.4
casey2014 194 194 6866 7.3
elephantisland2013 2247 187 21223 8.6
elephantisland2014 2595 216 20964 13
greenwich2015 190 31.7 1128 6.5
kerguelen2005 200 200 2960 1.8
maudrise2014 200 83.3 2360 6.9
rosssea2014 176 176 104 5
TOTAL TRAIN 6007 1292 57827 5.1
casey2017 187 185 3263 3.3
kerguelen2014 200 200 8822 5.7
kerguelen2015 200 200 5542 3.7
TOTAL VALIDATION 587 585 17627 5.1
Table 1. Summary statistics on the development dataset.

A more complete version of this table is available here, with more statistics within the different classes and more information on the recording deployments.

Annotation

Description

The annotation data of the development dataset correspond to the one published for IWC-SORP ATBFL dataset, where each site-year deployment comes with its own annotation file. The 11 csv annotation files were named after each corresponding site-year deployment. Each annotated sound event is defined by the tuple (label, low_frequency, high_frequency, start_datetime, end_datetime), with label taking a unique name in {bma, bmb, bmz, bmd, bpd, bp20, bp20plus}.

Protocol and feedbacks

The development dataset was annotated by a group of bioacoustician experts, with one expert per site-year deployment, following the SORP annotation guideline. Despite such precautions, and as in most bioacoustics annotated datasets, the annotation sets might still contain some defects that any model developer using this corpus should be aware of. Especially, despite the use of a common protocol, ensuring sufficient consistency between the different annotators, and thus between the different site-year annotation sets, is made particularly difficult for the following reasons, which would have required more standardized procedures:

  • different annotation styles and practices: different experts have different thresholds for when they mark a call. They also have different styles for marking starts and ends and high and low frequencies. Also, some analysts are precise and some are fast, impacting the overall accuracy of bounds - few are both;
  • fragmentation, multipath and splitting vs lumping - Long tonal calls can be fragmented due to propagation and multipath. Some analysts are splitters and annotate every fragment independently. Others are lumpers and will mark a single long call for all the fragments and multipaths ;
  • multipath: there is often confusion among expert analysts about whether a potential multipath is an echo or a separate animal calling. The broader context and sequence of calls will likely be helpful here.

All these causes are of course highly worsened within time periods with low SNR. For example, the site-year deployment Elephant island 2015 has already been recognized as a more difficult dataset to process due to this reason.

Having said that, it is widely recognized that having high quality annotations on such a large scale dataset is a very complex and cumbersome process, both in terms of human resources and scientific expertise. As recognized in related audio processing fields [6], these potential defects in the annotations of the development set should be seen as an intrinsic component of this data challenge reflecting real-life annotation practices and that should be fully addressed by models.

Download

Raw audio recordings of the development dataset, along with annotation data, can be downloaded from this Zenodo entry. Note that minor changes were made to the original ATBFL dataset such as adopting more consistent naming conventions, pooling together all annotation files per call type into a single file per dataset etc (see the complete list of minor changes on Zenodo).

Supplementary resources

  • All datasets, annotations and pre-trained models from this list are allowed;
  • Use of other external data (e.g. audio files, annotations) and pre-trained models are allowed only after approval from the task coordinators (contact: dorian.cazau@ensta.fr). These external data and models should be public, open datasets.

Model evaluation

Dataset

The evaluation dataset is composed of two new site-year deployments, not yet published as parts of the ATBFL dataset. They contain the same cetacean species as in the development dataset but from different sites in Antarctica and/or different time periods. Those datasets will be used as independent evaluation sets to get more detailed insights into the generalization performance of models, and an overall evaluation scoring will also be computed to have the global ranking of models.

Annotation

Within the IWC-SORP ATBFL project, the same annotation setup as the development dataset has been used for the evaluation dataset. In addition to that, to ensure the highest quality of evaluation annotations, a two-phase multi-annotator campaign has been specially designed for this challenge, including a complete re-annotation of all evaluation data, plus a double-checking procedure of the most conflictual cases. The complete protocol and associated results will be released at the end of the data challenge.

Metrics

The evaluation metrics used is a 1D IoU (standing for Intersection over Inion), which basically looks at all the time when the predicted event overlaps with the ground truth event, divided by the total time spanning from the minimum start time to the maximum end time. To emphasize the importance of estimating an accurate number of calls (for example, for a downstream task of population density estimation), this metrics was customized to penalize model outputs with several detections overlapping with one single ground truth. For example, if 3 predicted sound events overlap with one single ground truth event, only one of the predicted sound events will be marked as a true positive (TP) and assigned as correct, and the rest will be marked as false positives (FP). TP are then computed counting all the prediction events which have been marked as correct. FP are all the prediction events which were not assigned to a ground truth. FN are all the ground truth events which have not been assigned any prediction. Recall, Precision and f1-score are then computed per-class and per-deployment.

See more details on Github.

Download

This dataset will be released on the 1st June 2025.

Rules and submission

Participants are free to employ any preprocessing technique and network architecture. The only requirement is that the final output of your model MUST be a CSV file formatted following the annotation format of evaluation set described above.

Official challenge submission consists of:

  • System output file (*.csv)
  • Metadata file (*.yaml)
  • Technical report explaining in sufficient detail the method (*.pdf)

System output should be presented as a single text-file, in CSV format, with a header row as shown in the example output below:

dataset,filename,annotation,start_datetime,end_datetime
kerguelen2014,2014-02-18T21-00-00_000.wav,bma,2014-02-18T21:32:03.876700+00:00,2014-02-18T21:32:13.281600+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bma,2014-02-18T21:37:42.187800+00:00,2014-02-18T21:37:51.400800+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bmb,2014-02-18T21:39:06.640300+00:00,2014-02-18T21:39:15.277500+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bmz,2014-02-18T21:48:19.270900+00:00,2014-02-18T21:48:28.292000+00:00

Example meta information file baseline system Parcerisas_VLIZ_task2_1.meta.yaml:

# Submission information
submission:
  # Submission label
  # Label is used to index submissions, to avoid overlapping codes among submissions
  # use the following way to form your label:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Parcerisas_VLIZ_task2_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: yolo baseline

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight, maximum 10 characters
  abbreviation: yolo_base

  # Submission authors in order, mark one of the authors as corresponding author.
  authors:
    # First author
    - lastname: Parcerisas
      firstname: Clea
      email: clea.parcerisas@vliz.be                    # Contact email address
      corresponding: true                             # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: VLIZ
        institute: Flanders Marine Institute
        department: Marine Observation Centre
        location: Ostend, BE

        #... More authors can be specified by repeating the information


# System information
system:
  # SED system description, meta data provided here will be used to do
  # meta analysis of the submitted system. Use general level tags, if possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input
    input_sampling_rate: 250               # In kHz (specify any if you take it into account)

    # Acoustic representation
    acoustic_features: spectrogram   # e.g one or multiple [MFCC, log-mel energies, spectrogram, CQT, PCEN, ...]
    
    # Noise precentage
    noise_percentage: balanced   # specify "balanced" when taking the same amount of label samples. Otherwise specify percentage of the total noise as float from 0 to 1

    # Data augmentation methods
    data_augmentation: !!null             # [time stretching, block mixing, pitch shifting, ...] Specify !!null if none

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    machine_learning_method: YOLO         # e.g one or multiple [GMM, HMM, SVM, kNN, MLP, CNN, RNN, CRNN, NMF, random forest, ensemble, transformer, ...]

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: !!null

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    ensemble_method_subsystem_count: !!null # [2, 3, 4, 5, ... ]

    # Decision making methods (for ensemble)
    decision_making: !!null                 # [majority vote, ...]

    # Post-processing, followed by the time span (in ms) in case of smoothing
    post-processing: YOLO default				# [median filtering, time aggregation...]

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:

    # Total amount of parameters used in the acoustic model. For neural networks, this
    # information is usually given before training process in the network summary.
    # For other than neural networks, if parameter count information is not directly available,
    # try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    total_parameters: 9,428,953    # note that for simple template matching, the "parameters"==the pixel count of the templates, plus 1 for each param such as thresholding. 
    # Approximate training time followed by the hardware used
    trainining_time: 10h 40min
    # Model size in MB
    model_size: 22


  # URL to the source code of the system [optional, highly recommended]
  source_code: https://github.com/marinebioCASE/task2_2025/tree/main/baselines/yolo 

  # List of external datasets used in the submission.
  # A previous DCASE development dataset is used here only as example! List only external datasets
  external_datasets:
    # Dataset name
    - name: !!null
      # Dataset access url
      url: !!null
      # Total audio length in minutes
      total_audio_length: !!null            # minutes

# System results 
results:
  # Full results are not mandatory, but for through analysis of the challenge submissions recommended.
  # If you cannot provide all result details, also incomplete results can be reported.
  validation_set:
    overall:
      F1: 0.43 # 0 to 1
      precision: 0.67
      recall: 0.32 

    # Per-dataset
    dataset_wise:
      Casey2017: 
        F1: 0.55 # 0 to 1
        precision: 0.7
        recall: 0.46
      Kerguelen2014:
        F1: 0.38 # 0 to 1
        precision: 0.66
        recall: 0.29
      Kerguelen2015:
        F1: 0.40 # 0 to 1
        precision: 0.67
        recall: 0.32


You can directly download those files here.

Baseline

An off-the-shelf object detector model YOLOv11 was run to get baseline performance on the validation set (see table 2). In addition to providing reference performance for the task, this baseline has also been prepared to serve as a getting started code example, where you will find some general routines to load audio and annotation files in different Python frameworks, compute spectrograms, train baseline models and use the evaluation code on the validation set.

Model Recall Precision f1
YOLOv11 0.32 0.67 0.43
Table 2. Baseline performance on the validation set.

See more detailed results per class and per dataset here.

See baseline details and download codes on Github.

Support

If you have questions please use the BioDCASE Google Groups community forum, or directly contact task coordinators (dorian.cazau@ensta.fr).

References

[1] Miller et al. (2021). An open access dataset for developing automated detectors of Antarctic baleen whale sounds and performance evaluation of two commonly used detectors, Sci. Rep., 11, 806. doi:10.1038/s41598-020-78995-8
[2] Castro et al (2024). Beyond counting calls: estimating detection probability for Antarctic blue whales reveals biological trends in seasonal calling. Front. Mar. Sci. doi:10.3389/fmars.2024.1406678
[3] Miller et al. (2020). An annotated library of underwater acoustic recordings for testing and training automated algorithms for detecting Antarctic blue and fin whale sounds. doi: 10.26179/5e6056035c01b
[4] Schall et al. (2024). Deep learning in marine bioacoustics: a benchmark for baleen whale detection. Remote Sens Ecol Conserv, 10: 642-654. https://doi.org/10.1002/rse2.392
[5] Dubus et al. (2024). Improving automatic detection with supervised contrastive learning: application with low-frequency vocalizations. Workshop DCLDE (2024)
[6] Fonseca et al. (2022) FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM TALP, vol. 30 (1)