Overview
A fundamental challenge across bioacoustics domains (terrestrial and marine) is the annotation of unlabelled data. Passive acoustic monitoring systems generate vast amounts of data, but only a small portion can be feasibly annotated by expert human annotators. Since model performance depends heavily on the quality and quantity of labelled data, this raises the following research question:
Given vast amounts of raw acoustic data and limited annotation resources, which data should be prioritised for labelling?
Active learning (AL) is a critical strategy for scaling bioacoustic monitoring. Rather than annotating data at random, AL iteratively selects the most informative samples for labelling, aiming to maximise model performance within a fixed annotation budget. Participants will design a sampling (acquisition) function that selects which unlabelled samples should be labelled at each AL cycle. Submissions will be evaluated on how efficiently their method improves classification performance across two bioacoustic domains: terrestrial (BirdSet) and marine (ATBFL).
Description
You are provided with a pool of "unlabelled" embeddings (labels are hidden), corresponding to audio segments. Your task is to select which samples should be labelled at each cycle to maximise classification performance within a fixed annotation budget.
The active learning loop proceeds as follows:
- A pool of pre-generated
perch_v2embeddings and a randomly initialised classification head are provided. - Your sampling function selects a batch of samples from the unlabelled pool.
- Labels are automatically revealed for selected samples (oracle labelling).
- The model is retrained on all labelled samples so far.
- Steps 2–4 repeat until the annotation budget is exhausted.
(Optional) You can specify a warm-up sampling function and batch size for the first AL cycle.
Your sampling function receives the current model predictions and embeddings (and optionally metadata) and must return the indices of the next batch of samples to label. You are free to modify the warm-up strategy and batch size. The classification head (model.py) and core AL loop (active_learner.py) are fixed and may not be modified.
BaseAL
BaseAL is the evaluation framework for this task. It provides the AL pipeline, baseline sampling methods, and logging infrastructure. Participants implement their method by editing sampling.py within the framework.
Setup instruction are available here and you can find notebooks with example usage in docs/. An interactive demo of BaseAL (ESC10 data) is available.
Dataset
Limited annotation budget is a challenge which spans both terrestrial and marine domains. To evaluate the generalisability of methods across domains both a terrestrial dataset (BirdSet) and marine dataset (ATBFL) are provided.
These datasets have been curated for use with BaseAL. Perch_v2 embeddings have been pre-generated for 5 second audio segments and these will be the input for training of a classification head. Submissions will be evaluated on held out sets from BOTH datasets.
The max_budget for each data subset is 500 samples, this amounts to average of 5% of samples which can be used for training.
Structure
The data structure is consistent across both datasets:
subset
├── embeddings
│ ├── perch_v2
│ │ └── *.npy
├── labels.csv
├── metadata.csv
In BaseAL, ESC10 demo data is also provided, you can see the current config.yaml configuration and adjust as needed for the development set you are testing on.
See information for individual datasets and subsets below:
BirdSet (AL for Bioacoustics)
BirdSet is a large-scale benchmark dataset for avian bioacoustics, containing over 6,800 hours of recordings spanning nearly 10,000 species. It includes more than 400 hours across eight strongly-labelled evaluation subsets, each corresponding to a distinct geographic location, making it well-suited for evaluating model generalisation across soundscapes.
For this task, three of the eight evaluation subsets are provided: HSN, POW and UHH. Each subset contains pre-generated perch_v2 embeddings corresponding to 5-second multi-label segments. See details below related to individual subsets.
| Subset | Training segments | Validation segments | Classes | Av. labels per sample |
|---|---|---|---|---|
| HSN | 6600 | 1800 | 19 | 0.524 |
| POW | 2280 | 684 | 41 | 2.833 |
| UHH | 18319 | 7327 | 25 | 1.058 |
The BirdSet (AL for Bioacoustics) development set is available here:
ATBFL (AL for Bioacoustics)
This dataset is derived from the Acoustic Trends Blue Fin Library (ATBFL). ATBFL is one of the largest annotated datasets in marine bioacoustics, gathering blue and fin whale recordings around Antarctica from 2005 to 2017.
The data used in the AL for Bioacoustics task is derived from Task 2 (2025) development set, containing multi-label annotations across 7 call types from two species. Recordings are organised into site-year subsets (e.g. kerguelen2005, greenwich2015), each representing a distinct deployment. These subsets form the basis of the per-location evaluation. Pre-generated perch_v2 embeddings for 5-second segments are provided; see the table below for per-subset details.
| Subset | Training segments | Validation segments | Classes | Av. labels per sample |
|---|---|---|---|---|
| elephantisland2014 | 2197 | 1489 | 7 | 1.965 |
| elephantisland2013 | 1630 | 1092 | 7 | 2.650 |
| casey2014 | 1282 | 782 | 7 | 2.428 |
| kerguelen2014 | 1154 | 748 | 7 | 2.684 |
| kerguelen2015 | 729 | 479 | 7 | 2.839 |
| casey2017 | 613 | 405 | 6 | 1.858 |
| maudrise2014 | 526 | 353 | 7 | 1.288 |
| kerguelen2005 | 465 | 269 | 7 | 2.176 |
| ballenyislands2015 | 309 | 205 | 7 | 1.875 |
| greenwich2015 | 171 | 113 | 6 | 1.527 |
| rosssea2014 | 10 | 8 | 1 | 0.929 |
The task 4 ATBFL (AL for Bioacoustics) development set is available here, including additional information about dataset composition:
Baselines
The classification head (model) will be randomly initialised during evaluation. The sampling method will be evaluated across 5 independent runs and the average performance across all runs used for ranking.
Random: samples are selected randomly, simulating active learning not being applied.Margin(link): selects samples where the model is least confident, based on the smallest gap between the top two predicted class probabilities. The uncertainty is aggregated across classes (for multilabel) using the mean uncertainty (Settles 2009).CoreSet(link): greedily selects samples that are farthest from the current labelled set in embedding space, maximising coverage without considering model uncertainty (Sener and Savarese 2018).TypiClust(link): Typical clustering selects the most representative unlabelled sample from each of the least-covered regions of the embedding space, balancing diversity with density-based sample quality (Hacohen et al. 2022).
Baseline configuration: A fixed batch size of 50 samples, with 10 epochs per AL cycle and a learning rate of 1e-3 is used for all baselines, up to the max_budget of 500 samples.
| Method | AULC (mAP macro) | Computational cost (relative) | Computational cost (wall-time (s)) | Annotation cost |
|---|---|---|---|---|
| Random | 0.401 | 1.0 | 0.00095 | 96.75 |
| Margin | 0.399 | 1.0 | 0.00249 | 125.50 |
| CoreSet | 0.42154 | 1.0 | 6.34688 | 108.50 |
| TypiCluster | 0.39080 | 1.0 | 25.28352 | 98.25 |
The following section explains how these evaluation metrics are computed →
Evaluation
Ranking metric
Area Under the Learning Curve (AULC) for the mAP (macro) across a fixed budget of N samples. This metric will be aggregated across both datasets (and subsets) for the final ranking. Hence participants must consider how their sampling method generalises across domains.
AULC = (1 / N) * Σ mAP(n) for n = 1, ..., N
Supplementary Metrics
Additional metrics will be provided. These will not be used for overall ranking but will inform selection for the Jury Award.
Computational cost (training): The computational cost is defined as the relative increase in the number of model_parameters, training epochs epochs, number of active learning cycles within budget AL_cycles compared to the baseline configuration. This metric is therefore the percentage increase (or decrease) in cost.
cost = model_parameters * epochs * AL_cycles
relative_cost = cost_method / baseline_cost
Computational cost (sampling-cost): Time spent in the sampling step averaged across all AL cycles. Penalises computationally expensive sampling methods.
sampling_cost = (1 / T) * Σ sampling_time(t) for t = 1, ..., T
Annotation cost: Active learning generally only considers the number of samples selected for annotation, however the cost of annotation for challenging samples is important to consider. Here the annotation cost is the number of events per selected sample (e.g. the more multilabel events the higher the cost).
annotation_cost = Σ events(i) for i in selected_samples
Sub-domain performance: Both BirdSet and ATBFL are composed of several subsets corresponding to different locations. Performance for each of these locations will be reported.
Submission
General BioDCASE instructions are available here. This specifies submission naming conventions and the final report template. Please note some important differences:
- Participants are required to include the edited
sampling.pyfile for reproducibility and final ranking. - Instead of submitting predictions, participants submit a
.yamlfile containing metrics (e.g. AULC, computational cost etc).
Deliverables
Please submit a .zip file containing the following information. Please follow the general BioDCASE naming convention.
Results ({method}_{dataset}_{lastname}.yaml): containing the sampling performance metrics of your sampling method. Note that if you export these results using the BaseAL learner.export() method, this will already be in the correct format.
Please specify repeat of 5 when generating results, this will compute the mean and SD.
submission_timestamp: '2026-03-22T15:57:31'
author_lastname: baseline
institute_abbreviation: BASEAL
sampling_strategy: margin_multilabel
dataset: ATBFL_BASEAL
model: perch_v2
config:
learning_rate: 0.001
model_parameters: 2373128
n_outer_repeats: 5
pretrain_samples: 0
learning_curve:
- cycle: 1
n_labeled: 50
mAP_mean: 0.36654
mAP_sd: 0.066797
aulc_mAP_mean: 0.18327
aulc_mAP_sd: 0.033398
annotation_cost: 144
sampling_time_s_mean: 0.003362
sampling_time_s_sd: 0.000203
- cycle: 2
n_labeled: 100
mAP_mean: 0.384691
mAP_sd: 0.05046
aulc_mAP_mean: 0.279443
aulc_mAP_sd: 0.045983
annotation_cost: 131
sampling_time_s_mean: 0.003424
sampling_time_s_sd: 0.00042
- cycle: 3
n_labeled: 150
mAP_mean: 0.411882
mAP_sd: 0.047491
aulc_mAP_mean: 0.319057
aulc_mAP_sd: 0.045024
annotation_cost: 176
sampling_time_s_mean: 0.003471
sampling_time_s_sd: 0.000303
- cycle: 4
n_labeled: 200
mAP_mean: 0.431502
mAP_sd: 0.033669
aulc_mAP_mean: 0.344716
aulc_mAP_sd: 0.042434
annotation_cost: 177
sampling_time_s_mean: 0.003403
sampling_time_s_sd: 0.000249
- cycle: 5
n_labeled: 250
mAP_mean: 0.446552
mAP_sd: 0.02969
aulc_mAP_mean: 0.363578
aulc_mAP_sd: 0.039633
annotation_cost: 146
sampling_time_s_mean: 0.00334
sampling_time_s_sd: 0.000316
- cycle: 6
n_labeled: 300
mAP_mean: 0.463594
mAP_sd: 0.01702
aulc_mAP_mean: 0.378827
aulc_mAP_sd: 0.036091
annotation_cost: 202
sampling_time_s_mean: 0.00332
sampling_time_s_sd: 0.0003
- cycle: 7
n_labeled: 350
mAP_mean: 0.465332
mAP_sd: 0.019022
aulc_mAP_mean: 0.391061
aulc_mAP_sd: 0.032631
annotation_cost: 205
sampling_time_s_mean: 0.003468
sampling_time_s_sd: 0.000429
- cycle: 8
n_labeled: 400
mAP_mean: 0.47003
mAP_sd: 0.022713
aulc_mAP_mean: 0.400638
aulc_mAP_sd: 0.030427
annotation_cost: 156
sampling_time_s_mean: 0.003332
sampling_time_s_sd: 0.000335
- cycle: 9
n_labeled: 450
mAP_mean: 0.474832
mAP_sd: 0.010245
aulc_mAP_mean: 0.408615
aulc_mAP_sd: 0.028394
annotation_cost: 175
sampling_time_s_mean: 0.003634
sampling_time_s_sd: 0.000646
- cycle: 10
n_labeled: 500
mAP_mean: 0.476957
mAP_sd: 0.008449
aulc_mAP_mean: 0.415343
aulc_mAP_sd: 0.026245
annotation_cost: 158
sampling_time_s_mean: 0.003644
sampling_time_s_sd: 0.000389
- cycle: 11
n_labeled: 550
mAP_mean: 0.477933
mAP_sd: 0.006941
aulc_mAP_mean: 0.420989
aulc_mAP_sd: 0.024222
annotation_cost: 167
sampling_time_s_mean: 0.003398
sampling_time_s_sd: 0.000395
- cycle: 12
n_labeled: 600
mAP_mean: 0.479481
mAP_sd: 0.009083
aulc_mAP_mean: 0.425799
aulc_mAP_sd: 0.022404
annotation_cost: 145
sampling_time_s_mean: 0.003368
sampling_time_s_sd: 0.00026
- cycle: 13
n_labeled: 650
mAP_mean: 0.482552
mAP_sd: 0.007105
aulc_mAP_mean: 0.430046
aulc_mAP_sd: 0.02098
annotation_cost: 152
sampling_time_s_mean: 0.003418
sampling_time_s_sd: 0.000288
- cycle: 14
n_labeled: 700
mAP_mean: 0.481459
mAP_sd: 0.009624
aulc_mAP_mean: 0.433758
aulc_mAP_sd: 0.01964
annotation_cost: 132
sampling_time_s_mean: 0.003404
sampling_time_s_sd: 0.000308
- cycle: 15
n_labeled: 750
mAP_mean: 0.482672
mAP_sd: 0.013172
aulc_mAP_mean: 0.436978
aulc_mAP_sd: 0.018296
annotation_cost: 124
sampling_time_s_mean: 0.003383
sampling_time_s_sd: 0.000296
supplementary:
n_cycles: 15
n_outer_repeats: 5
total_annotation_cost_mean: 2270.0
total_annotation_cost_sd: 127.54
total_sampling_time_s_mean: 0.05137
total_sampling_time_s_sd: 0.004374
computational_cost:
model_parameters: 2373128
epochs_per_cycle: 10
n_cycles: 15
cost_method: 355969200
baseline_n_cycles: 15
baseline_cost: 355969200
relative_cost: 1.0
Sampling Method:
sampling.py: sampling method. Follow instructions for setting up_customsampling method.requirements.txt(optional): provide additional dependencies that were installed if any.method_description.ipynb(optional): example usage of your sampling method. Highly recommended if you are using a batch size scheduler.
Report (.pdf): final report (see submissions page for details)
Leaderboard
(Optional) During the development phase, a running leader board will be available on the BaseAL webpage. If you wish to participate then send the submission package (without report) directly to Ben McEwen.
Rules
- Participants should not change the classification head (
model.py) or core active learning loop (active_learner.py) within BaseAL. - Dataset specific sampling methods and configurations are not allowed, your method should be generalisable across datasets.
- Making use of dataset specific metadata (e.g. time, location etc) is allowed, however the sampling method should still run on datasets for which this data is not available.
- Sampling methods and configurations should be designed to be run on a single consumer-grade GPU. Please avoid small batch sizes and computationally expensive approaches that require HPC access. Submissions with significant computational overhead are not feasible for bioacoustic monitoring and will not be ranked.
- Labels are provided for oracle labelling, do not use labels within
labels.csvfor sampling.
Citation
If you use BaseAL or either of the curated datasets in your research please cite as follows.
BaseAL
@software{mcewen_baseal,
author={McEwen, Ben and Zhang, Shiqi},
title={{BaseAL}: Active Learning Baseline},
year={2026},
version={v1.1.1},
publisher={Zenodo},
doi={10.5281/zenodo.18467564},
url={https://doi.org/10.5281/zenodo.18467564}
}
BioDCASE 2026 Task 4 - BirdSet Dataset
@dataset{rauch_2026_19191603,
author = {Rauch, Lukas and
Herde, Marek and
McEwen, Ben},
title = {BioDCASE 2026 Task 4: BirdSet Dataset},
month = mar,
year = 2026,
publisher = {Zenodo},
doi = {10.5281/zenodo.19191602},
url = {https://doi.org/10.5281/zenodo.19191602},
}
Please cite the original source as well.
BioDCASE 2026 Task 4 - ATBFL Dataset
@dataset{kurinchi_vendhan_2026_19133112,
author= {Kurinchi-Vendhan, Rupa and
Zhang, Shiqi and
McEwen, Ben},
title = {BioDCASE 2026 Task 4: ATBFL Dataset},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19133111},
url = {https://doi.org/10.5281/zenodo.19133111},
}
Please cite the original source as well.
Support
The Active Learning for Bioacoustics community slack channel will now be hosted within the new BioDCASE Community workspace. If you are not yet a member already please join and also join the #task4-active-learning channel
If you have specific questions, not relevant to other participants, please email Ben McEwen .