More details can be found on the task page:

Teams ranking

	Submission information		Rank		Dataset metrics		Overall metrics
Rank	Name	Technical Report	Official rank	Rank score	DDU2021 F-Score	Kerguelen2020 F-Score	Overall F-Score	Overall Recall	Overall Precision
	tiny yolov4 coco	Ferguson2025	7	27.50	27.50	28.90	27.50	17.50	63.70
	edge optimised CNN	vanToor2025	13	4.40	3.50	10.60	4.40	2.40	31.90
	edge optimised CNN	vanToor2025	12	8.50	7.60	14.00	8.50	4.80	35.50
	Swin transformer fusion with softmax	Marolt2025	1	50.00	46.40	50.40	50.00	47.70	52.50
	Swin transformer fusion with softmax	Marolt2025	2	43.80	38.80	53.40	43.80	33.80	62.40
	Enhanced Spectrogram-YOLO	Nihal2025	11	8.60	7.20	16.00	8.60	4.60	70.80
	Whale-VAD	Geldenhuys2025	5	34.90	34.20	32.10	34.90	41.40	30.20
	Whale-VAD	Geldenhuys2025	6	33.60	33.90	31.90	33.60	45.80	26.60
	Whale-VAD		4	35.70	33.90	36.30	35.70	38.20	33.60
	voxaboxen dv	Hausler2025	8	19.20	20.50	11.20	19.20	18.10	20.60
	voxaboxen dv	Hausler2025	10	15.90	16.40	13.20	15.90	10.80	29.90
	voxaboxen dv	Hausler2025	9	17.80	18.50	14.10	17.80	13.90	24.90
	Baseline YOLOv11		3	39.20	39.50	42.70	39.20	33.10	48.00

Technical reports

BioDCASE Baleen Whale Deep Learning Detection and Classification Network

Alongi, Gabriela and Ferguson, Liz and Sugarman, Peter

Ocean Science Analytics

Ferguson_task2

PDF

BioDCASE Baleen Whale Deep Learning Detection and Classification Network

Alongi, Gabriela and Ferguson, Liz and Sugarman, Peter
Ocean Science Analytics

PDF

WHALE-VAD: whale vocalization activity detection

Christiaan M. Geldenhuys

Univeristy of Stellenbosch

Geldenhuys_SUN_task2

PDF

WHALE-VAD: whale vocalization activity detection

Christiaan M. Geldenhuys
Univeristy of Stellenbosch

Abstract

In this work, we present a sound event detection (SED) system fo- cused on whale call detection. We propose a hybrid CNN-BiLSTM architecture adapted from the voice activity detection (VAD) field in order to perform coherent per-frame whale call activity detection. In addition, we investigate the multi-objective regression task of bounding box estimation in conjunction to activity detection. We compare the performance of our system to a baseline mel spec- trogram BiLSTM, and finetuned HuBERT system. As part of the 2025 BioDCASE challenge (Task 2), we also compare our system to ResNet-18 and YOLOVv11 models. Each model has been trained on a subset of the publically available ATBFL dataset. Our model was able to best all models including the top performing YOLOv11 model on developmental results. The final model, trained using the collapsed labels along with phase information, achieved an F1-score of 0.44, across all developmental sets.

System characteristics

Classifier

WHALE-VAD

PDF

Deep Voice Below the Surface: Improved Whale Call Detection via Voxaboxen Refinement

Danielle Hausler

Deep Voice Foundation

Hausler_task2

PDF

Deep Voice Below the Surface: Improved Whale Call Detection via Voxaboxen Refinement

Danielle Hausler
Deep Voice Foundation

Abstract

We present a solution for Task 2 of the BioDCASE 2025 Challenge—“Supervised detection of strongly-labelled whale calls”—using the Voxaboxen framework. Originally devel- oped for bioacoustic vocalization annotation, Voxaboxen pre- dicts the temporal start and end points of calls and re- gresses their durations. By graph-matching forward predictions (start+duration) with backward predictions (end+duration), tight bounding boxes in the time domain are generated for each detected call. Unlike YOLO-style detectors that propose many candidate boxes and select the best, our method builds each box directly, improving temporal precision and reducing duplicate proposals. We adapt this framework to detect Antarctic blue and fin whale calls in the BioDCASE Task 2 dataset, and evaluate performance on the official evaluation set, demonstrating enhanced overlap accuracy and call localization.

PDF

Multi resolution feature fusion for supervised detection of strongly-labelled whale calls

Marolt, Matija and Bones, Eva

University of Ljubljana, Faculty of Computer and Information Science

Marolt_task2

PDF

Multi resolution feature fusion for supervised detection of strongly-labelled whale calls

Marolt, Matija and Bones, Eva
University of Ljubljana, Faculty of Computer and Information Science

Abstract

We present the outline of our deep architecture for supervised detection of strongly- labelled whale calls. Our architecture is based on three parallel Swin transformers [1],that process the input audio on multiple time scales. The input audio is windowed into approx. 16 second long normalized chunks, which are processed by the learnable frontend for audio classification - LEAF [2] to obtain a time-frequency representation. Three different representations are calculated, using the same number of frequency channels but different window sizes. Each is processed by a different three-layer Swin transformer (using 4x4 patch sizes with 2x2 stride to obtain initial input patch tokens, and with patch merging between layers [3]). Output feature maps with different time resolutions are upscaled to the same time resolution and fused with a downscaling convolution in feature space. The final feature map is upscaled in time to the original time resolution, followed by the fully-connected classification layer. Trained on the provided training set and evaluated on the validation set, the architecture yields a micro-F1 score of 0.58 and a macro-F1 score of 0.64. To obtain the output for the task, the input context window is shifted by 0.5 seconds along the time axis and the outputs of the model are averaged for the same time instances. Two versions of the model’s output are submitted - one using the softmax activation in the output layer (multi-class ), and another using the sigmoid activation (multi-label) - the latter should result in greater recall with lower precision than the former.

PDF

Enhanced spectrogram processing with temporal sequences for antarctic whale call detection using YOLOv11

Ragib Amin Nihal

Institute of Science Tokyo

Nihal_task2_1

PDF

Enhanced spectrogram processing with temporal sequences for antarctic whale call detection using YOLOv11

Ragib Amin Nihal
Institute of Science Tokyo

Abstract

We present a modified spectrogram processing approach combined with temporal sequence analysis for Antarctic whale call detection in the BioDCASE 2025 Challenge Task 2. Building on the base- line YOLO approach, We implement enhanced spectrogram pre- processing through pre-filtering, magnitude inversion, and 98th per- centile normalization. We created temporal awareness by generat- ing 3-frame RGB sequences from consecutive spectrogram frames, allowing a YOLOv11m detector to process temporal information. Class-specific confidence thresholds are applied based on validation performance analysis. On the validation set, the approach achieves 59.45% F1-score, 72.88% precision, and 51.93% recall, represent- ing an improvement over the baseline YOLO performance of 43% F1-score.

PDF

BioDCASE 2025 challenge combo: supervised whale calls on tiny hardware

Astrid Van Toor

blueOasis

vanToor_BO_task2

PDF

BioDCASE 2025 challenge combo: supervised whale calls on tiny hardware

Astrid Van Toor
blueOasis

Abstract

This technical report presents an edge-optimised approach to baleen whale call detection for the BioDCASE 2025 challenge - Task 2. Taking inspiration from Task 3, it focuses on deploy- ment constraints of resource-limited hardware. Where common models range in parameter starting from 4 million training pa- rameters [1] with architectures often unsuitable for real-time edge deployment, our model contains just 35,571 training parameters (159KB) and operates efficiently on a 64-bit ARM Cortex-A53 with 512MB RAM. On a detection window basis of 11.8 second frames, the model performs well on two of the three classes; applying a precision-focused approach we detect blue whale ABZ calls at 72% precision and fin whale burst pulse calls at 80% precision, while downsweep predictions lack behind at 18% precision. Applying our temporal head designed for compression into TFLite, we main- tain reasonable precision for ABZ calls at 65%, while downsweep calls rise to 29% precision and burst pulse calls drop significantly to 4%. Acknowledging the difficulties in call-specific identification, this report highlights the feasibility and potential of edge-optimised architectures for baleen whale detection in real-world monitoring scenarios where computational resources and power consumption are severely constrained, while addressing common challenges and next steps to improve the results.

System characteristics

Classifier

TinyHardware

PDF

Content

Teams ranking

Technical reports

BioDCASE Baleen Whale Deep Learning Detection and Classification Network

BioDCASE Baleen Whale Deep Learning Detection and Classification Network

WHALE-VAD: whale vocalization activity detection

WHALE-VAD: whale vocalization activity detection

Abstract

System characteristics

Deep Voice Below the Surface: Improved Whale Call Detection via Voxaboxen Refinement

Deep Voice Below the Surface: Improved Whale Call Detection via Voxaboxen Refinement

Abstract

Multi resolution feature fusion for supervised detection of strongly-labelled whale calls

Multi resolution feature fusion for supervised detection of strongly-labelled whale calls

Abstract

Enhanced spectrogram processing with temporal sequences for antarctic whale call detection using YOLOv11

Enhanced spectrogram processing with temporal sequences for antarctic whale call detection using YOLOv11

Abstract

BioDCASE 2025 challenge combo: supervised whale calls on tiny hardware

BioDCASE 2025 challenge combo: supervised whale calls on tiny hardware

Abstract

System characteristics