Passive Acoustic Monitoring (PAM) is a technology used to analyze sounds in the Ocean, where our capacity of visual observation is highly limited. It has emerged as a transformative tool for applied ecology, conservation and biodiversity monitoring. In particular, it offers unique opportunities to examine long-term trends in population dynamics, abundance, distribution or behaviour of different whale species. But for this purpose, the automation of PAM data processing, involving the automatic detection of whale calls in long-term recordings, faces two major issues: the scarcity of calls and the variability of soundscapes.

In this data challenge, a supervised sound event detection task was designed, and applied to the detection of 7 different call types from two emblematic whale species, the Antarctic blue and fin whales. This task aims to improve and assess the ability of models to address the two issues just mentioned, as models will have to deal with whale calls happening only 6 % of the time, and PAM recordings coming from different time periods and sites all around Antarctica that present highly variable soundscapes. The White Continent appeared to be a very exciting playground to start a large-scale evaluation of model generalization capacity, but challenging for sure!

Challenges results now available:

The evaluation set:

Task 2 Evaluation

Scientific context

Antarctic blue (Balaenoptera musculus intermedia) and fin (Balaenoptera physalus quoyi) whales were nearly wiped out during industrial whaling. For the past twenty-five years, long-term passive acoustic monitoring has provided one of the few cost-effective means of studying them on their remote feeding grounds at high latitudes around the Antarctic continent.

Long term acoustic monitoring efforts have been conducted by several nations in the Antarctic, and in recent years this work has been coordinated internationally via the Acoustic Trends Working Group within the Southern Ocean Research Partnership of the International Whaling Commission (IWC-SORP). Some of the overarching goals of the Acoustic Trends Project include “using acoustics to examine trends in Antarctic blue and fin whale population growth, abundance, distribution, seasonal movements, and behaviour” [1], implementing ecological metrics on the presence of calls per time-period (ranging from minutes to months), and more recently on the number of calls per time-period [2].

In 2020, the Acoustic Trends Project released publicly one of the largest annotated datasets for marine bioacoustics, the so-called AcousticTrends_BlueFinLibrary (ATBFL) [3]. Release of this annotated library was intended to help standardise analysis and compare the performance of different detectors across the range of locations, years, and instruments used to monitor these species. It has already been exploited in several benchmarking research papers [3][4][5].

Task definition

The task is a classical supervised multi-class and multi-label sound event detection task using strong labels. The latter depict the start and end time of the events, and consequently the target of the models is to provide not only the event class but also the event time localization given that multiple events can be present in an audio recording (see Fig 1). In the context of the IWC-SORP Acoustic Trends Project described above, this task is applied to the detection of 7 different call types from Antarctic blue and fin whales, grouped together into 3 categories for evaluation. This task aims to challenge and assess the generalization ability of models to adapt and perform in varying acoustic environments while being able to detect sound events with low presence rate (below 10 %), reflecting the real-world difficulties encountered in marine mammal monitoring.

Figure 1. Overview of the sound event detection task. Blue whale D and Z calls (BmD and BmZ) are present, as well as Fin whale 20 Hz Pulse with (Bp20p) or without (Bp20) overtone and its 40Hz downsweep (BpD).

Call description

The 7 calls types to detect are as follows:

Antarctic blue whale Z-Call (annotation label: BmZ) : smooth transition of single unit presence from 27 to 16 Hz, composed of three-parts A, B and C
Antarctic blue whale A-Call (annotation label: BmA) : Z-call containing only the A part
Antarctic blue whale B-Call (annotation label: BmB) : Z-call containing the A and B parts
Antarctic blue whale D-Call (annotation label: BmD): comprises downsweeping frequency component between 20 and 120 Hz, can also comprise more frequency modulations (e.g., upsweep at start)
Fin whale 20 Hz Pulse without overtone (annotation label: Bp20) : comprises downsweeping frequency component from 30 to 15 Hz
Fin whale 20 Hz Pulse with overtone (annotation label: Bp20plus) : 20 Hz pulse with a secondary energy at varying frequencies (80 to 120 Hz variation)
Fin whale 40Hz downsweep (annotation label: BpD) : downsweeping vocalisation ending around 40 Hz, usually below 90 Hz and above 30 Hz

See more detailed information and examples of call spectrograms here.

Model development

Train and validation sets

The overall development set relies on the entire ATBFL library, already introduced in the scientific context. As described in Table 1, it is organized in 11 site-year datasets, corresponding to deployments located all around Antarctica with time periods of recording ranging from 2005 to 2017. It contains a total of 6591 audio files, totaling 1880 hours of recording, sampled at 250 Hz.

The training set is composed of all site-year datasets with the exception of Kerguelen 2014, Kerguelen 2015 and Casey 2017, which have been left out from training to form the validation set. This makes a total of 6007 audio files for the training set over 8 site-year datasets, and 587 audio files for the validation set over 3 site-year datasets.

Table 1. Summary statistics on the development set.
Dataset	Number of audio recordings	Total duration (h)	Total events	Ratio event/duration (%)
ballenyisland2015	205	204	2222	1.4
casey2014	194	194	6866	7.3
elephantisland2013	2247	187	21223	8.6
elephantisland2014	2595	216	20964	13
greenwich2015	190	31.7	1128	6.5
kerguelen2005	200	200	2960	1.8
maudrise2014	200	83.3	2360	6.9
rosssea2014	176	176	104	5
TOTAL TRAIN	6007	1292	57827	5.1
casey2017	187	185	3263	3.3
kerguelen2014	200	200	8822	5.7
kerguelen2015	200	200	5542	3.7
TOTAL VALIDATION	587	585	17627	5.1

A more complete version of this table is available here, with more statistics within the different classes and more information on the recording deployments.

See details about the dataset tree structure on the Zenodo page.

Please check this Erratum page highlighting a few inconsistencies in the annotations of the corpus.

Annotation

Description

The annotation data of the development set correspond to those published with the ATBFL library, where each site-year dataset comes with its own annotation file. Each annotated sound event is defined by the tuple (dataset,filename,annotation,annotator,low_frequency,high_frequency,start_datetime,end_datetime), with annotation representing the class label and taking a unique name in {bma, bmb, bmz, bmd, bpd, bp20, bp20plus}. Note that these labels correspond to a more machine-readable version of the list {BmA, BmB, BmZ, BmD, BpD, Bp20, Bp20plus} described above in Call description. Annotator represents the short name of the expert annotator who have produced the annotation file. There is one single annotator per dataset but a same annotator may have annotated several datasets, as for example balcazar for the datasets casey2014 and kerguelen2015.

Here is an example of annotation file below:

dataset,filename,annotation,annotator,low_frequency,high_frequency,start_datetime,end_datetime
ballenyislands2015,2015-02-04T03-00-00_000.wav,bma,nieukirk,21.9,28.4,2015-02-04T03:27:32.053000+00:00,2015-02-04T03:27:43.709000+00:00
ballenyislands2015,2015-02-16T11-00-00_000.wav,bma,nieukirk,22.4,27.3,2015-02-16T11:54:10.436000+00:00,2015-02-16T11:54:16.414000+00:00
ballenyislands2015,2015-02-16T11-00-00_000.wav,bma,nieukirk,24.5,27.3,2015-02-16T11:52:24.753000+00:00,2015-02-16T11:52:29.236000+00:00
ballenyislands2015,2015-02-16T11-00-00_000.wav,bma,nieukirk,24.2,27.6,2015-02-16T11:50:47.251000+00:00,2015-02-16T11:50:53.004000+00:00
ballenyislands2015,2015-02-20T00-00-00_000.wav,bma,nieukirk,22.7,27.3,2015-02-20T00:38:36.379000+00:00,2015-02-20T00:38:44.225000+00:00

See more details on the Zenodo page.

Protocol and feedbacks

The development set was annotated by a group of bioacoustician experts, with one expert per site-year dataset, following the SORP annotation guideline. Despite such precautions, and as in most bioacoustics annotated datasets, the annotation data might still contain some defects that any model developer using this corpus should be aware of. Especially, despite the use of a common protocol, ensuring sufficient consistency between the different annotators, and thus between the different site-year annotation data, is made particularly difficult for the following reasons, which would have required more standardized procedures:

different annotation styles and practices: different experts have different thresholds for when they mark a call. They also have different styles for marking starts and ends and high and low frequencies. Also, some analysts are precise and some are fast, impacting the overall accuracy of bounds - few are both;
fragmentation, multipath and splitting vs lumping: long tonal calls can be fragmented due to propagation and multipath. Some analysts are splitters and annotate every fragment independently. Others are lumpers and will mark a single long call for all the fragments and multipaths ;
multipath: there is often confusion among expert analysts about whether a potential multipath is an echo or a separate animal calling. The broader context and sequence of calls will likely be helpful here.

All these causes are of course highly worsened within time periods with low SNR. From past research works using the ATBFL library [4,7], we can also provide the following feedbacks specific to some site-year datasets:

for Elephant Island and Balleny Island, missed annotations have been observed, and might represent a problem for model training;
for Casey and Maud Rise, the strong chorus band caused by both fin whale 20Hz-Pulses and blue whale Z-Calls increases the challenge to detect single calls within that chorus band and sometimes higher frequency components (such as the 20Plus annotations) are the only way to differentiate single calls from the underlying chorus;
see example spectrograms for these cases in [7].

Having said that, it is widely recognized that having high quality annotations on such a large scale dataset is a very complex and cumbersome process, both in terms of human resources and scientific expertise. As recognized in related audio processing fields [6], these potential defects in the annotations of the development set should be seen as an intrinsic component of this data challenge reflecting real-life annotation practices, and as such they should be fully addressed by models.

Download

Raw audio recordings of the 11 site-year datasets of the development set, along with their annotation data, can be downloaded from this Zenodo repository. Note that minor changes were made to the original ATBFL library such as adopting more consistent naming conventions, pooling together all annotation files per call type into a single file per dataset etc (see the complete list of minor changes on Zenodo).

BioDCASE 2025 Task 2 (922.7 GB)

Supplementary resources

All datasets, annotations and pre-trained models from this list are allowed;
Use of other external data (e.g. audio files, annotations) and pre-trained models are allowed only after approval from the task coordinators (contact: dorian.cazau@ensta.fr). These external data and models should be public, open datasets.

Model evaluation

Evaluation set

The evaluation set is composed of two new site-year datasets, not yet published as parts of the ATBFL library. They contain the same whale species as in the development set but from different sites in Antarctica and/or different time periods. Those two datasets will be used as independent evaluation datasets to get more detailed insights into the generalization performance of models, and an overall evaluation scoring will also be computed to have the global ranking of models.

Annotation

Still within the IWC-SORP Acoustic Trends Project, the same annotation setup as for the development set was used for the evaluation set. In addition to that, to ensure the highest quality of evaluation annotations, a two-phase multi-annotator campaign has been specially designed for this challenge, including a complete re-annotation of all evaluation data, plus a double-checking procedure of the most conflictual cases. The complete protocol and associated results will be released at the end of the data challenge.

Metrics

The evaluation metrics used is a 1D IoU (standing for Intersection over Union) over the temporal axis, which basically looks at all the time when the predicted event overlaps with the ground truth event, divided by the total time spanning from the minimum start time to the maximum end time. To emphasize the importance of estimating an accurate number of calls (for example, for a downstream task of population density estimation), this metrics was customized to penalize model outputs with several detections overlapping with one single ground truth. For example, if 3 predicted sound events overlap with one single ground truth event, only one of the predicted sound events will be marked as a True Positive (TP) and assigned as correct, and the rest will be marked as False Positives (FP). TP are then computed counting all the prediction events which have been marked as correct. FP are all the prediction events which were not assigned to a ground truth. FN are all the ground truth events which have not been assigned any prediction. Recall, Precision and f1-score are then computed per-class and per-dataset.

See more details on Github.

Download

Raw audio recordings of the 2 site-year datasets of the evaluation set can be downloaded from this Zenodo repository.

Rules and submission

Official challenge submission consists of:

Model output file (*.csv)
Metadata file (*.yaml)
Technical report explaining in sufficient details the model (*.pdf)

First, the final model output must be presented as a single text-file, in CSV format, with a header row as shown in the example output below:

dataset,filename,annotation,start_datetime,end_datetime
kerguelen2014,2014-02-18T21-00-00_000.wav,bma,2014-02-18T21:32:03.876700+00:00,2014-02-18T21:32:13.281600+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bma,2014-02-18T21:37:42.187800+00:00,2014-02-18T21:37:51.400800+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bmb,2014-02-18T21:39:06.640300+00:00,2014-02-18T21:39:15.277500+00:00
kerguelen2014,2014-02-18T21-00-00_000.wav,bmz,2014-02-18T21:48:19.270900+00:00,2014-02-18T21:48:28.292000+00:00

Note that in comparison to the annotation files used during the development phase, the fields related to frequency information and annotators are not necessary here any more.

Second, a complete YAML metadata file describing the model should be provided. Here is an example Parcerisas_VLIZ_task2_1.meta.yaml:

Metadata

# Submission information
submission:
  # Submission label
  # Label is used to index submissions, to avoid overlapping codes among submissions
  # use the following way to form your label:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: Parcerisas_VLIZ_task2_1

  # Submission name
  # This name will be used in the results tables when space permits
  name: yolo baseline

  # Submission name abbreviated
  # This abbreviated name will be used in the results table when space is tight, maximum 10 characters
  abbreviation: yolo_base

  # Submission authors in order, mark one of the authors as corresponding author.
  authors:
    # First author
    - lastname: Parcerisas
      firstname: Clea
      email: clea.parcerisas@vliz.be                    # Contact email address
      corresponding: true                             # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: VLIZ
        institute: Flanders Marine Institute
        department: Marine Observation Centre
        location: Ostend, BE

        #... More authors can be specified by repeating the information


# System information
system:
  # SED system description, meta data provided here will be used to do
  # meta analysis of the submitted system. Use general level tags, if possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input
    input_sampling_rate: 250               # In kHz (specify any if you take it into account)

    # Acoustic representation
    acoustic_features: spectrogram   # e.g one or multiple [MFCC, log-mel energies, spectrogram, CQT, PCEN, ...]
    
    # Noise precentage
    noise_percentage: balanced   # specify "balanced" when taking the same amount of label samples. Otherwise specify percentage of the total noise as float from 0 to 1

    # Data augmentation methods
    data_augmentation: !!null             # [time stretching, block mixing, pitch shifting, ...] Specify !!null if none

    # Embeddings
    # e.g. VGGish, OpenL3, ...
    embeddings: !!null

    # Machine learning
    # In case using ensemble methods, please specify all methods used (comma separated list).
    machine_learning_method: YOLO         # e.g one or multiple [GMM, HMM, SVM, kNN, MLP, CNN, RNN, CRNN, NMF, random forest, ensemble, transformer, ...]

    # External data usage method
    # e.g. directly, embeddings, pre-trained model, ...
    external_data_usage: !!null

    # Ensemble method subsystem count
    # In case ensemble method is not used, mark !!null.
    ensemble_method_subsystem_count: !!null # [2, 3, 4, 5, ... ]

    # Decision making methods (for ensemble)
    decision_making: !!null                 # [majority vote, ...]

    # Post-processing, followed by the time span (in ms) in case of smoothing
    post-processing: YOLO default				# [median filtering, time aggregation...]

  # System complexity, meta data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:

    # Total amount of parameters used in the acoustic model. For neural networks, this
    # information is usually given before training process in the network summary.
    # For other than neural networks, if parameter count information is not directly available,
    # try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    total_parameters: 9,428,953    # note that for simple template matching, the "parameters"==the pixel count of the templates, plus 1 for each param such as thresholding. 
    # Approximate training time followed by the hardware used
    trainining_time: 10h 40min
    # Model size in MB
    model_size: 22


  # URL to the source code of the system [optional, highly recommended]
  source_code: https://github.com/marinebioCASE/task2_2025/tree/main/baselines/yolo 

  # List of external datasets used in the submission.
  # A previous DCASE development dataset is used here only as example! List only external datasets
  external_datasets:
    # Dataset name
    - name: !!null
      # Dataset access url
      url: !!null
      # Total audio length in minutes
      total_audio_length: !!null            # minutes

# System results 
results:
  # Full results are not mandatory, but for through analysis of the challenge submissions recommended.
  # If you cannot provide all result details, also incomplete results can be reported.
  validation_set:
    overall:
      F1: 0.43 # 0 to 1
      precision: 0.67
      recall: 0.32 

    # Per-dataset
    dataset_wise:
      Casey2017: 
        F1: 0.55 # 0 to 1
        precision: 0.7
        recall: 0.46
      Kerguelen2014:
        F1: 0.38 # 0 to 1
        precision: 0.66
        recall: 0.29
      Kerguelen2015:
        F1: 0.40 # 0 to 1
        precision: 0.67
        recall: 0.32

You can directly download those files here.

See more details on the technical report and challenge rules here.

Baselines

Two off-the-shelf models, namely a YOLOv11 and a ResNet18, were run to get baseline performance on the validation set (see Table 2). In addition to providing reference performance for the task (but with a lot of room for improvement), these baselines have also been prepared to serve as getting started code examples, where you will find some general routines to load audio and annotation files in different Python frameworks, compute spectrograms, train baseline models and use the evaluation code on the validation set.

Table 2. Baseline performance on the validation set.
Model	Recall	Precision	f1
YOLOv11	0.32	0.67	0.43
ResNet18	0.36	0.29	0.32

See more detailed results per class and per dataset here.

See baseline details and download codes on Github.

Support

If you have questions please use the BioDCASE Google Groups community forum, or directly contact task coordinators dorian.cazau@ensta.fr.

References

[1] Miller et al. (2021). An open access dataset for developing automated detectors of Antarctic baleen whale sounds and performance evaluation of two commonly used detectors, Sci. Rep., 11, 806. doi:10.1038/s41598-020-78995-8
[2] Castro et al (2024). Beyond counting calls: estimating detection probability for Antarctic blue whales reveals biological trends in seasonal calling. Front. Mar. Sci. doi:10.3389/fmars.2024.1406678
[3] Miller et al. (2020). An annotated library of underwater acoustic recordings for testing and training automated algorithms for detecting Antarctic blue and fin whale sounds. doi: 10.26179/5e6056035c01b
[4] Schall et al. (2024). Deep learning in marine bioacoustics: a benchmark for baleen whale detection. Remote Sens Ecol Conserv, 10: 642-654. https://doi.org/10.1002/rse2.392
[5] Dubus et al. (2024). Improving automatic detection with supervised contrastive learning: application with low-frequency vocalizations. Workshop DCLDE (2024)
[6] Fonseca et al. (2022) FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM TALP, vol. 30 (1)
[7] Schall and Parcerisas (2022). A Robust Method to Automatically Detect Fin Whale Acoustic Presence in Large and Diverse Passive Acoustic Datasets. J. Mar. Sci. Eng. 2022, 10(12), 1831; https://doi.org/10.3390/jmse10121831

	Dorian Cazau ENSTA, Institut Polytechnique de Paris
	Olivier Adam Sorbonne Université, LAM
	Paul Carvaillo France Energies Marines
	Gabriel Dubus Sorbonne Université, LAM
	Anatole Gros-Martial Centre d’Etudes Biologiques de Chizé, GEO-Ocean
	Lucie Jean-Labadye Sorbonne Université, LAM
	Axel Marmoret IMT Atlantique
	Brian Miller Australian Antarctic Division
	Ilyass Moummad INRIA
	Andrea Napoli University of Southampton, Institute of Sound and Vibration Research
	Paul Nguyen Hong Duc Curtin University
	Clea Parcerisas VLIZ
	Marie Roch San Diego State University, Marine Bioacoustics Research Collaborative
	Pierre-Yves le Rolland Raumer IUEM
	Elena Schall AWI
	Paul White University of Southampton, Institute of Sound and Vibration Research
	Ellen White University of Southampton, Institute of Sound and Vibration Research

Supervised Detection of Strongly-Labelled Whale Calls

Coordinators

Scientific context

Task definition

Call description

Model development

Train and validation sets

Annotation

Description

Protocol and feedbacks

Download

Supplementary resources

Model evaluation

Evaluation set

Annotation

Metrics

Download

Rules and submission

Metadata

Baselines

Support

References

Coordinators

Content

Scientific context

Task definition

Call description

Model development

Train and validation sets

Annotation

Description

Protocol and feedbacks

Download

Supplementary resources

Model evaluation

Evaluation set

Annotation

Metrics

Download

Rules and submission

Metadata

Baselines

Support

References