More details can be found on the task page:
Teams ranking
Submission information | Rank | Dataset metrics | Overall metrics | ||||||
---|---|---|---|---|---|---|---|---|---|
Rank | Name |
Technical Report |
Official rank |
Rank score |
DDU2021 F-Score | Kerguelen2020 F-Score | Overall F-Score | Overall Recall | Overall Precision |
tiny yolov4 coco | Ferguson2025 | 7 | 27.50 | 27.50 | 28.90 | 27.50 | 17.50 | 63.70 | |
edge optimised CNN | vanToor2025 | 13 | 4.40 | 3.50 | 10.60 | 4.40 | 2.40 | 31.90 | |
edge optimised CNN | vanToor2025 | 12 | 8.50 | 7.60 | 14.00 | 8.50 | 4.80 | 35.50 | |
Swin transformer fusion with softmax | Marolt2025 | 1 | 50.00 | 46.40 | 50.40 | 50.00 | 47.70 | 52.50 | |
Swin transformer fusion with softmax | Marolt2025 | 2 | 43.80 | 38.80 | 53.40 | 43.80 | 33.80 | 62.40 | |
Enhanced Spectrogram-YOLO | Nihal2025 | 11 | 8.60 | 7.20 | 16.00 | 8.60 | 4.60 | 70.80 | |
Whale-VAD | Geldenhuys2025 | 5 | 34.90 | 34.20 | 32.10 | 34.90 | 41.40 | 30.20 | |
Whale-VAD | Geldenhuys2025 | 6 | 33.60 | 33.90 | 31.90 | 33.60 | 45.80 | 26.60 | |
Whale-VAD | 4 | 35.70 | 33.90 | 36.30 | 35.70 | 38.20 | 33.60 | ||
voxaboxen dv | Hausler2025 | 8 | 19.20 | 20.50 | 11.20 | 19.20 | 18.10 | 20.60 | |
voxaboxen dv | Hausler2025 | 10 | 15.90 | 16.40 | 13.20 | 15.90 | 10.80 | 29.90 | |
voxaboxen dv | Hausler2025 | 9 | 17.80 | 18.50 | 14.10 | 17.80 | 13.90 | 24.90 | |
Baseline YOLOv11 | 3 | 39.20 | 39.50 | 42.70 | 39.20 | 33.10 | 48.00 |
Technical reports
BioDCASE Baleen Whale Deep Learning Detection and Classification Network
Alongi, Gabriela and Ferguson, Liz and Sugarman, Peter
Ocean Science Analytics
Ferguson_task2
BioDCASE Baleen Whale Deep Learning Detection and Classification Network
Alongi, Gabriela and Ferguson, Liz and Sugarman, Peter
Ocean Science Analytics
WHALE-VAD: whale vocalization activity detection
Christiaan M. Geldenhuys
Univeristy of Stellenbosch
Geldenhuys_SUN_task2
WHALE-VAD: whale vocalization activity detection
Christiaan M. Geldenhuys
Univeristy of Stellenbosch
Abstract
In this work, we present a sound event detection (SED) system fo- cused on whale call detection. We propose a hybrid CNN-BiLSTM architecture adapted from the voice activity detection (VAD) field in order to perform coherent per-frame whale call activity detection. In addition, we investigate the multi-objective regression task of bounding box estimation in conjunction to activity detection. We compare the performance of our system to a baseline mel spec- trogram BiLSTM, and finetuned HuBERT system. As part of the 2025 BioDCASE challenge (Task 2), we also compare our system to ResNet-18 and YOLOVv11 models. Each model has been trained on a subset of the publically available ATBFL dataset. Our model was able to best all models including the top performing YOLOv11 model on developmental results. The final model, trained using the collapsed labels along with phase information, achieved an F1-score of 0.44, across all developmental sets.
System characteristics
Classifier | WHALE-VAD |
Deep Voice Below the Surface: Improved Whale Call Detection via Voxaboxen Refinement
Danielle Hausler
Deep Voice Foundation
Hausler_task2
Deep Voice Below the Surface: Improved Whale Call Detection via Voxaboxen Refinement
Danielle Hausler
Deep Voice Foundation
Abstract
We present a solution for Task 2 of the BioDCASE 2025 Challenge—“Supervised detection of strongly-labelled whale calls”—using the Voxaboxen framework. Originally devel- oped for bioacoustic vocalization annotation, Voxaboxen pre- dicts the temporal start and end points of calls and re- gresses their durations. By graph-matching forward predictions (start+duration) with backward predictions (end+duration), tight bounding boxes in the time domain are generated for each detected call. Unlike YOLO-style detectors that propose many candidate boxes and select the best, our method builds each box directly, improving temporal precision and reducing duplicate proposals. We adapt this framework to detect Antarctic blue and fin whale calls in the BioDCASE Task 2 dataset, and evaluate performance on the official evaluation set, demonstrating enhanced overlap accuracy and call localization.
Multi resolution feature fusion for supervised detection of strongly-labelled whale calls
Marolt, Matija and Bones, Eva
University of Ljubljana, Faculty of Computer and Information Science
Marolt_task2
Multi resolution feature fusion for supervised detection of strongly-labelled whale calls
Marolt, Matija and Bones, Eva
University of Ljubljana, Faculty of Computer and Information Science
Abstract
We present the outline of our deep architecture for supervised detection of strongly- labelled whale calls. Our architecture is based on three parallel Swin transformers [1],that process the input audio on multiple time scales. The input audio is windowed into approx. 16 second long normalized chunks, which are processed by the learnable frontend for audio classification - LEAF [2] to obtain a time-frequency representation. Three different representations are calculated, using the same number of frequency channels but different window sizes. Each is processed by a different three-layer Swin transformer (using 4x4 patch sizes with 2x2 stride to obtain initial input patch tokens, and with patch merging between layers [3]). Output feature maps with different time resolutions are upscaled to the same time resolution and fused with a downscaling convolution in feature space. The final feature map is upscaled in time to the original time resolution, followed by the fully-connected classification layer. Trained on the provided training set and evaluated on the validation set, the architecture yields a micro-F1 score of 0.58 and a macro-F1 score of 0.64. To obtain the output for the task, the input context window is shifted by 0.5 seconds along the time axis and the outputs of the model are averaged for the same time instances. Two versions of the model’s output are submitted - one using the softmax activation in the output layer (multi-class ), and another using the sigmoid activation (multi-label) - the latter should result in greater recall with lower precision than the former.
Enhanced spectrogram processing with temporal sequences for antarctic whale call detection using YOLOv11
Ragib Amin Nihal
Institute of Science Tokyo
Nihal_task2_1
Enhanced spectrogram processing with temporal sequences for antarctic whale call detection using YOLOv11
Ragib Amin Nihal
Institute of Science Tokyo
Abstract
We present a modified spectrogram processing approach combined with temporal sequence analysis for Antarctic whale call detection in the BioDCASE 2025 Challenge Task 2. Building on the base- line YOLO approach, We implement enhanced spectrogram pre- processing through pre-filtering, magnitude inversion, and 98th per- centile normalization. We created temporal awareness by generat- ing 3-frame RGB sequences from consecutive spectrogram frames, allowing a YOLOv11m detector to process temporal information. Class-specific confidence thresholds are applied based on validation performance analysis. On the validation set, the approach achieves 59.45% F1-score, 72.88% precision, and 51.93% recall, represent- ing an improvement over the baseline YOLO performance of 43% F1-score.
BioDCASE 2025 challenge combo: supervised whale calls on tiny hardware
Astrid Van Toor
blueOasis
vanToor_BO_task2
BioDCASE 2025 challenge combo: supervised whale calls on tiny hardware
Astrid Van Toor
blueOasis
Abstract
This technical report presents an edge-optimised approach to baleen whale call detection for the BioDCASE 2025 challenge - Task 2. Taking inspiration from Task 3, it focuses on deploy- ment constraints of resource-limited hardware. Where common models range in parameter starting from 4 million training pa- rameters [1] with architectures often unsuitable for real-time edge deployment, our model contains just 35,571 training parameters (159KB) and operates efficiently on a 64-bit ARM Cortex-A53 with 512MB RAM. On a detection window basis of 11.8 second frames, the model performs well on two of the three classes; applying a precision-focused approach we detect blue whale ABZ calls at 72% precision and fin whale burst pulse calls at 80% precision, while downsweep predictions lack behind at 18% precision. Applying our temporal head designed for compression into TFLite, we main- tain reasonable precision for ABZ calls at 65%, while downsweep calls rise to 29% precision and burst pulse calls drop significantly to 4%. Acknowledging the difficulties in call-specific identification, this report highlights the feasibility and potential of edge-optimised architectures for baleen whale detection in real-world monitoring scenarios where computational resources and power consumption are severely constrained, while addressing common challenges and next steps to improve the results.
System characteristics
Classifier | TinyHardware |