Supervised Detection of Strongly-Labelled Whale Calls


Task 2 Description

Coordinators

Dorian Cazau
Olivier Adam
Olivier Adam

Sorbonne Université, LAM

Paul Carvaillo
Paul Carvaillo

France Energies Marines

Gabriel Dubus
Gabriel Dubus

Sorbonne Université, LAM

Anatole Gros-Martial
Anatole Gros-Martial

Centre d’Etudes Biologiques de Chizé, GEO-Ocean

Lucie Jean-Labadye
Lucie Jean-Labadye

Sorbonne Université, LAM

Axel Marmoret
Axel Marmoret

IMT Atlantique

Brian Miller
Brian Miller

Australian Antarctic Division

Ilyass Moummad
Ilyass Moummad

INRIA

Andrea Napoli
Paul Nguyen Hong Duc
Paul Nguyen Hong Duc

Curtin University

Clea Parcerisas
Clea Parcerisas
Marie Roch
Marie Roch

San Diego State University, Marine Bioacoustics Research Collaborative

Pierre-Yves le Rolland Raumer
Pierre-Yves le Rolland Raumer

IUEM

Elena Schall
Elena Schall

AWI

Paul White
Ellen White

Passive Acoustic Monitoring (PAM) is a technology used to analyze sounds in the Ocean, where our capacity of visual observation is highly limited. It has emerged as a transformative tool for applied ecology, conservation and biodiversity monitoring. In particular, it offers unique opportunities to examine long-term trends in population dynamics, abundance, distribution or behaviour of different whale species. But for this purpose, the automation of PAM data processing, involving the automatic detection of whale calls in long-term recordings, faces two major issues: the scarcity of calls and the variability of soundscapes.

In this data challenge, a supervised sound event detection task was designed, and applied to the detection of 7 different call types from two emblematic whale species, the Antarctic blue and fin whales. This task aims to improve and assess the ability of models to address the two issues just mentioned, as models will have to deal with whale calls happening only 6 % of the time, and PAM recordings coming from different time periods and sites all around Antarctica that present highly variable soundscapes. The White Continent appeared to be a very exciting playground to start a large-scale evaluation of model generalization capacity, but challenging for sure!

Scientific context

Antarctic blue (Balaenoptera musculus intermedia) and fin (Balaenoptera physalus quoyi) whales were nearly wiped out during industrial whaling. For the past twenty-five years, long-term passive acoustic monitoring has provided one of the few cost-effective means of studying them on their remote feeding grounds at high latitudes around the Antarctic continent.

Long term acoustic monitoring efforts have been conducted by several nations in the Antarctic, and in recent years this work has been coordinated internationally via the Acoustic Trends Working Group within the Southern Ocean Research Partnership of the International Whaling Commission (IWC-SORP). Some of the overarching goals of the Acoustic Trends Project include “using acoustics to examine trends in Antarctic blue and fin whale population growth, abundance, distribution, seasonal movements, and behaviour” [1], implementing ecological metrics on the presence of calls per time-period (ranging from minutes to months), and more recently on the number of calls per time-period [2].

In 2020, the Acoustic Trends Project released publicly one of the largest annotated datasets for marine bioacoustics, the so-called AcousticTrends_BlueFinLibrary (ATBFL) [3]. Release of this annotated library was intended to help standardise analysis and compare the performance of different detectors across the range of locations, years, and instruments used to monitor these species. It has already been exploited in several benchmarking research papers [3][4][5].

Task definition

The task is a classical supervised multi-class and multi-label sound event detection task using strong labels. The latter depict the start and end time of the events, and consequently the target of the models is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see Fig 1). In the context of the IWC-SORP Acoustic Trends Project described above, this task is applied to the detection of 7 different call types from Antarctic blue and fin whales, grouped together into 3 categories for evaluation. This task aims to challenge and assess the generalization ability of models to adapt and perform in varying acoustic environments while being able to detect sound events with low presence rate (below 10 %), reflecting the real-world difficulties encountered in marine mammal monitoring.

Figure 1. Overview of the sound event detection task. Blue whale D and Z calls (BmD and BmZ) are present, as well as Fin whale 20 Hz Pulse with (Bp20p) or without (Bp20) overtone and its 40Hz downsweep (BpD).


Call description

The 7 calls types to detect are as follows:

  • Antarctic blue whale Z-Call (annotation label: BmZ) : smooth transition of single unit presence from 27 to 16 Hz, composed of three-parts A, B and C
  • Antarctic blue whale A-Call (annotation label: BmA) : Z-call containing only the A part
  • Antarctic blue whale B-Call (annotation label: BmB) : Z-call containing the A and B parts
  • Antarctic blue whale D-Call (annotation label: BmD): comprises downsweeping frequency component between 20 and 120 Hz, can also comprise more frequency modulations (e.g., upsweep at start)
  • Fin whale 20 Hz Pulse without overtone (annotation label: Bp20) : comprises downsweeping frequency component from 30 to 15 Hz
  • Fin whale 20 Hz Pulse with overtone (annotation label: Bp20plus) : 20 Hz pulse with a secondary energy at varying frequencies (80 to 120 Hz variation)
  • Fin whale 40Hz downsweep (annotation label: BpD) : downsweeping vocalisation ending around 40 Hz, usually below 90 Hz and above 30 Hz

See more detailed information and examples of call spectrograms here.

Model development

Train and validation sets

The overall development set relies on the entire ATBFL library, already introduced in the scientific context. As described in Table 1, it is organized in 11 site-year datasets, corresponding to deployments located all around Antarctica with time periods of recording ranging from 2005 to 2017. It contains a total of 6591 audio files, totaling 1880 hours of recording, sampled at 250 Hz.

The training set is composed of all site-year datasets with the exception of Kerguelen 2014, Kerguelen 2015 and Casey 2017, which have been left out from training to form the validation set. This makes a total of 6007 audio files for the training set over 8 site-year datasets, and 587 audio files for the validation set over 3 site-year datasets.

Dataset Number of audio recordings Total duration (h) Total events Ratio event/duration (%)
ballenyisland2015 205 204 2222 1.4
casey2014 194 194 6866 7.3
elephantisland2013 2247 187 21223 8.6
elephantisland2014 2595 216 20964 13
greenwich2015 190 31.7 1128 6.5
kerguelen2005 200 200 2960 1.8
maudrise2014 200 83.3 2360 6.9
rosssea2014 176 176 104 5
TOTAL TRAIN 6007 1292 57827 5.1
casey2017 187 185 3263 3.3
kerguelen2014 200 200 8822 5.7
kerguelen2015 200 200 5542 3.7
TOTAL VALIDATION 587 585 17627 5.1
Table 1. Summary statistics on the development set.

A more complete version of this table is available here, with more statistics within the different classes and more information on the recording deployments.

See details about the dataset tree structure on the Zenodo page.

Please check this Erratum page highlighting a few inconsistencies in the annotations of the corpus.

Annotation

Description

The annotation data of the development set correspond to those published with the ATBFL library, where each site-year dataset comes with its own annotation file. Each annotated sound event is defined by the tuple (dataset,filename,annotation,annotator,low_frequency,high_frequency,start_datetime,end_datetime), with annotation representing the class label and taking a unique name in {bma, bmb, bmz, bmd, bpd, bp20, bp20plus}. Note that these labels correspond to a more machine-readable version of the list {BmA, BmB, BmZ, BmD, BpD, Bp20, Bp20plus} described above in Call description. Annotator represents the short name of the expert annotator who have produced the annotation file. There is one single annotator per dataset but a same annotator may have annotated several datasets, as for example balcazar for the datasets casey2014 and kerguelen2015.

Here is an example of annotation file below:

dataset,filename,annotation,annotator,low_frequency,high_frequency,start_datetime,end_datetime
ballenyislands2015,2015-02-04T03-00-00_000.wav,bma,nieukirk,21.9,28.4,2015-02-04T03:27:32.053000+00:00,2015-02-04T03:27:43.709000+00:00
ballenyislands2015,2015-02-16T11-00-00_000.wav,bma,nieukirk,22.4,27.3,2015-02-16T11:54:10.436000+00:00,2015-02-16T11:54:16.414000+00:00
ballenyislands2015,2015-02-16T11-00-00_000.wav,bma,nieukirk,24.5,27.3,2015-02-16T11:52:24.753000+00:00,2015-02-16T11:52:29.236000+00:00
ballenyislands2015,2015-02-16T11-00-00_000.wav,bma,nieukirk,24.2,27.6,2015-02-16T11:50:47.251000+00:00,2015-02-16T11:50:53.004000+00:00
ballenyislands2015,2015-02-20T00-00-00_000.wav,bma,nieukirk,22.7,27.3,2015-02-20T00:38:36.379000+00:00,2015-02-20T00:38:44.225000+00:00

See more details on the Zenodo page.

Protocol and feedbacks

The development set was annotated by a group of bioacoustician experts, with one expert per site-year dataset, following the SORP annotation guideline. Despite such precautions, and as in most bioacoustics annotated datasets, the annotation data might still contain some defects that any model developer using this corpus should be aware of. Especially, despite the use of a common protocol, ensuring sufficient consistency between the different annotators, and thus between the different site-year annotation data, is made particularly difficult for the following reasons, which would have required more standardized procedures:

  • different annotation styles and practices: different experts have different thresholds for when they mark a call. They also have different styles for marking starts and ends and high and low frequencies. Also, some analysts are precise and some are fast, impacting the overall accuracy of bounds - few are both;
  • fragmentation, multipath and splitting vs lumping: long tonal calls can be fragmented due to propagation and multipath. Some analysts are splitters and annotate every fragment independently. Others are lumpers and will mark a single long call for all the fragments and multipaths ;
  • multipath: there is often confusion among expert analysts about whether a potential multipath is an echo or a separate animal calling. The broader context and sequence of calls will likely be helpful here.

All these causes are of course highly worsened within time periods with low SNR. From past research works using the ATBFL library [4,7], we can also provide the following feedbacks specific to some site-year datasets:

  • for Elephant Island and Balleny Island, missed annotations have been observed, and might represent a problem for model training;
  • for Casey and Maud Rise, the strong chorus band caused by both fin whale 20Hz-Pulses and blue whale Z-Calls increases the challenge to detect single calls within that chorus band and sometimes higher frequency components (such as the 20Plus annotations) are the only way to differentiate single calls from the underlying chorus;
  • see example spectrograms for these cases in [7].

Having said that, it is widely recognized that having high quality annotations on such a large scale dataset is a very complex and cumbersome process, both in terms of human resources and scientific expertise. As recognized in related audio processing fields [6], these potential defects in the annotations of the development set should be seen as an intrinsic component of this data challenge reflecting real-life annotation practices, and as such they should be fully addressed by models.

Download

Raw audio recordings of the 11 site-year datasets of the development set, along with their annotation data, can be downloaded from this Zenodo repository. Note that minor changes were made to the original ATBFL library such as adopting more consistent naming conventions, pooling together all annotation files per call type into a single file per dataset etc (see the complete list of minor changes on Zenodo).

Supplementary resources

  • All datasets, annotations and pre-trained models from this list are allowed;
  • Use of other external data (e.g. audio files, annotations) and pre-trained models are allowed only after approval from the task coordinators (contact: dorian.cazau@ensta.fr). These external data and models should be public, open datasets.

Model evaluation

Evaluation set

The evaluation set is composed of two new site-year datasets, not yet published as parts of the ATBFL library. They contain the same whale species as in the development set but from different sites in Antarctica and/or different time periods. Those two datasets will be used as independent evaluation datasets to get more detailed insights into the generalization performance of models, and an overall evaluation scoring will also be computed to have the global ranking of models.

Annotation

Still within the IWC-SORP Acoustic Trends Project, the same annotation setup as for the development set was used for the evaluation set. In addition to that, to ensure the highest quality of evaluation annotations, a two-phase multi-annotator campaign has been specially designed for this challenge, including a complete re-annotation of all evaluation data, plus a double-checking procedure of the most conflictual cases. The complete protocol and associated results will be released at the end of the data challenge.

Metrics

The evaluation metrics used is a 1D IoU (standing for Intersection over Union) over the temporal axis, which basically looks at all the time when the predicted event overlaps with the ground truth event, divided by the total time spanning from the minimum start time to the maximum end time. To emphasize the importance of estimating an accurate number of calls (for example, for a downstream task of population density estimation), this metrics was customized to penalize model outputs with several detections overlapping with one single ground truth. For example, if 3 predicted sound events overlap with one single ground truth event, only one of the predicted sound events will be marked as a True Positive (TP) and assigned as correct, and the rest will be marked as False Positives (FP). TP are then computed counting all the prediction events which have been marked as correct. FP are all the prediction events which were not assigned to a ground truth. FN are all the ground truth events which have not been assigned any prediction. Recall, Precision and f1-score are then computed per-class and per-dataset.

See more details on Github.

Download

This evaluation set will be released on the 1st June 2025.

Rules and submission

Official challenge submission consists of:

  • Model output file (*.csv)
  • Metadata file (*.yaml)
  • Technical report explaining in sufficient details the model (*.pdf)

First, the final model output must be presented as a single text-file, in CSV format, with a header row as shown in the example output below:

dataset,filename,annotation,start_datetime,end_datetime
kerguelen2014,2014-02-18T21-00-00_000.wav,bma,2014-02-18T21:32:03.876700+00:00,2014-02-18T21:32:13.281600+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bma,2014-02-18T21:37:42.187800+00:00,2014-02-18T21:37:51.400800+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bmb,2014-02-18T21:39:06.640300+00:00,2014-02-18T21:39:15.277500+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bmz,2014-02-18T21:48:19.270900+00:00,2014-02-18T21:48:28.292000+00:00

Note that in comparison to the annotation files used during the development phase, the fields related to frequency information and annotators are not necessary here any more.

Second, a complete YAML metadata file describing the model should be provided. Here is an example Parcerisas_VLIZ_task2_1.meta.yaml:

# Submission information
submission:
  # Submission label
  # Label is used to index submissions, to avoid overlapping codes among submissions
  # use the following way to form your label:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Parcerisas_VLIZ_task2_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: yolo baseline

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight, maximum 10 characters
  abbreviation: yolo_base

  # Submission authors in order, mark one of the authors as corresponding author.
  authors:
    # First author
    - lastname: Parcerisas
      firstname: Clea
      email: clea.parcerisas@vliz.be                    # Contact email address
      corresponding: true                             # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: VLIZ
        institute: Flanders Marine Institute
        department: Marine Observation Centre
        location: Ostend, BE

        #... More authors can be specified by repeating the information


# System information
system:
  # SED system description, meta data provided here will be used to do
  # meta analysis of the submitted system. Use general level tags, if possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input
    input_sampling_rate: 250               # In kHz (specify any if you take it into account)

    # Acoustic representation
    acoustic_features: spectrogram   # e.g one or multiple [MFCC, log-mel energies, spectrogram, CQT, PCEN, ...]
    
    # Noise precentage
    noise_percentage: balanced   # specify "balanced" when taking the same amount of label samples. Otherwise specify percentage of the total noise as float from 0 to 1

    # Data augmentation methods
    data_augmentation: !!null             # [time stretching, block mixing, pitch shifting, ...] Specify !!null if none

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    machine_learning_method: YOLO         # e.g one or multiple [GMM, HMM, SVM, kNN, MLP, CNN, RNN, CRNN, NMF, random forest, ensemble, transformer, ...]

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: !!null

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    ensemble_method_subsystem_count: !!null # [2, 3, 4, 5, ... ]

    # Decision making methods (for ensemble)
    decision_making: !!null                 # [majority vote, ...]

    # Post-processing, followed by the time span (in ms) in case of smoothing
    post-processing: YOLO default				# [median filtering, time aggregation...]

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:

    # Total amount of parameters used in the acoustic model. For neural networks, this
    # information is usually given before training process in the network summary.
    # For other than neural networks, if parameter count information is not directly available,
    # try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    total_parameters: 9,428,953    # note that for simple template matching, the "parameters"==the pixel count of the templates, plus 1 for each param such as thresholding. 
    # Approximate training time followed by the hardware used
    trainining_time: 10h 40min
    # Model size in MB
    model_size: 22


  # URL to the source code of the system [optional, highly recommended]
  source_code: https://github.com/marinebioCASE/task2_2025/tree/main/baselines/yolo 

  # List of external datasets used in the submission.
  # A previous DCASE development dataset is used here only as example! List only external datasets
  external_datasets:
    # Dataset name
    - name: !!null
      # Dataset access url
      url: !!null
      # Total audio length in minutes
      total_audio_length: !!null            # minutes

# System results 
results:
  # Full results are not mandatory, but for through analysis of the challenge submissions recommended.
  # If you cannot provide all result details, also incomplete results can be reported.
  validation_set:
    overall:
      F1: 0.43 # 0 to 1
      precision: 0.67
      recall: 0.32 

    # Per-dataset
    dataset_wise:
      Casey2017: 
        F1: 0.55 # 0 to 1
        precision: 0.7
        recall: 0.46
      Kerguelen2014:
        F1: 0.38 # 0 to 1
        precision: 0.66
        recall: 0.29
      Kerguelen2015:
        F1: 0.40 # 0 to 1
        precision: 0.67
        recall: 0.32


You can directly download those files here.

See more details on the technical report and challenge rules here.

Baselines

Two off-the-shelf models, namely a YOLOv11 and a ResNet18, were run to get baseline performance on the validation set (see Table 2). In addition to providing reference performance for the task (but with a lot of room for improvement), these baselines have also been prepared to serve as getting started code examples, where you will find some general routines to load audio and annotation files in different Python frameworks, compute spectrograms, train baseline models and use the evaluation code on the validation set.

Model Recall Precision f1
YOLOv11 0.32 0.67 0.43
ResNet18 0.36 0.29 0.32
Table 2. Baseline performance on the validation set.

See more detailed results per class and per dataset here.

See baseline details and download codes on Github.

Support

If you have questions please use the BioDCASE Google Groups community forum, or directly contact task coordinators dorian.cazau@ensta.fr.

References

[1] Miller et al. (2021). An open access dataset for developing automated detectors of Antarctic baleen whale sounds and performance evaluation of two commonly used detectors, Sci. Rep., 11, 806. doi:10.1038/s41598-020-78995-8
[2] Castro et al (2024). Beyond counting calls: estimating detection probability for Antarctic blue whales reveals biological trends in seasonal calling. Front. Mar. Sci. doi:10.3389/fmars.2024.1406678
[3] Miller et al. (2020). An annotated library of underwater acoustic recordings for testing and training automated algorithms for detecting Antarctic blue and fin whale sounds. doi: 10.26179/5e6056035c01b
[4] Schall et al. (2024). Deep learning in marine bioacoustics: a benchmark for baleen whale detection. Remote Sens Ecol Conserv, 10: 642-654. https://doi.org/10.1002/rse2.392
[5] Dubus et al. (2024). Improving automatic detection with supervised contrastive learning: application with low-frequency vocalizations. Workshop DCLDE (2024)
[6] Fonseca et al. (2022) FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM TALP, vol. 30 (1)
[7] Schall and Parcerisas (2022). A Robust Method to Automatically Detect Fin Whale Acoustic Presence in Large and Diverse Passive Acoustic Datasets. J. Mar. Sci. Eng. 2022, 10(12), 1831; https://doi.org/10.3390/jmse10121831