Active Learning for Bioacoustics

Overview

A fundamental challenge across bioacoustics domains (terrestrial and marine) is the annotation of unlabelled data. Passive acoustic monitoring systems generate vast amounts of data, but only a small portion can be feasibly annotated by expert human annotators. Since model performance depends heavily on the quality and quantity of labelled data, this raises the following research question:

Given vast amounts of raw acoustic data and limited annotation resources, which data should be prioritised for labelling?

Active learning (AL) is a critical strategy for scaling bioacoustic monitoring. Rather than annotating data at random, AL iteratively selects the most informative samples for labelling, aiming to maximise model performance within a fixed annotation budget. Participants will design a sampling (acquisition) function that selects which unlabelled samples should be labelled at each AL cycle. Submissions will be evaluated on how efficiently their method improves classification performance across two bioacoustic domains: terrestrial (BirdSet) and marine (ATBFL).

NEWS: Task 4 results are now available. View results

Description

You are provided with a pool of "unlabelled" embeddings (labels are hidden), corresponding to audio segments. Your task is to select which samples should be labelled at each cycle to maximise classification performance within a fixed annotation budget.

The active learning loop proceeds as follows:

A pool of pre-generated perch_v2 embeddings and a randomly initialised classification head are provided.
Your sampling function selects a batch of samples from the unlabelled pool.
Labels are automatically revealed for selected samples (oracle labelling).
The model is retrained on all labelled samples so far.
Steps 2–4 repeat until the annotation budget is exhausted.

(Optional) You can specify a warm-up sampling function and batch size for the first AL cycle.

Your sampling function receives the current model predictions and embeddings (and optionally metadata) and must return the indices of the next batch of samples to label. You are free to modify the warm-up strategy and batch size. The classification head (model.py) and core AL loop (active_learner.py) are fixed and may not be modified.

BaseAL

BaseAL is the evaluation framework for this task. It provides the AL pipeline, baseline sampling methods, and logging infrastructure. Participants implement their method by editing sampling.py within the framework.

Setup instruction are available here and you can find notebooks with example usage in docs/. The notebook biodcase_example.ipynb contains everything you need to get started and is the recommended starting point. An interactive demo of BaseAL (ESC10 data) is available.

BaseAL Demo

Dataset

Limited annotation budget is a challenge which spans both terrestrial and marine domains. To evaluate the generalisability of methods across domains both a terrestrial dataset (BirdSet) and marine dataset (ATBFL) are provided.

These datasets have been curated for use with BaseAL. Perch_v2 embeddings have been pre-generated for 5 second audio segments and these will be the input for training of a classification head. Submissions will be evaluated on held out sets from BOTH datasets.

The max_budget for each data subset is 500 samples, this amounts to average of 5% of samples which can be used for training.

Structure

The data structure is consistent across both datasets:

subset
├── embeddings
│   ├── perch_v2
│   │   └── *.npy
├── labels.csv
├── metadata.csv

In BaseAL, ESC10 demo data is also provided, you can see the current config.yaml configuration and adjust as needed for the development set you are testing on.

See information for individual datasets and subsets below:

BirdSet (AL for Bioacoustics)

BirdSet is a large-scale benchmark dataset for avian bioacoustics, containing over 6,800 hours of recordings spanning nearly 10,000 species. It includes more than 400 hours across eight strongly-labelled evaluation subsets, each corresponding to a distinct geographic location, making it well-suited for evaluating model generalisation across soundscapes.

For this task, three of the eight evaluation subsets are provided: HSN, POW and UHH. Each subset contains pre-generated perch_v2 embeddings corresponding to 5-second multi-label segments. See details below related to individual subsets.

Table 1. Summary statistics on the BirdSet (Task 4) development set.
Subset	Training segments	Validation segments	Classes	Av. labels per sample
HSN	6600	1800	19	0.524
POW	2280	684	41	2.833
UHH	18319	7327	25	1.058

The BirdSet (AL for Bioacoustics) development set is available here:

BioDCASE 2026 Task 4 - BirdSet Dataset (225 MB)

ATBFL (AL for Bioacoustics)

This dataset is derived from the Acoustic Trends Blue Fin Library (ATBFL). ATBFL is one of the largest annotated datasets in marine bioacoustics, gathering blue and fin whale recordings around Antarctica from 2005 to 2017.

The data used in the AL for Bioacoustics task is derived from Task 2 (2025) development set, containing multi-label annotations across 7 call types from two species. Recordings are organised into site-year subsets (e.g. kerguelen2005, greenwich2015), each representing a distinct deployment. These subsets form the basis of the per-location evaluation. Pre-generated perch_v2 embeddings for 5-second segments are provided; see the table below for per-subset details.

Table 2. Summary statistics on the ATBFL (Task 4) development set.
Subset	Training segments	Validation segments	Classes	Av. labels per sample
elephantisland2014	2197	1489	7	1.965
elephantisland2013	1630	1092	7	2.650
casey2014	1282	782	7	2.428
kerguelen2014	1154	748	7	2.684
kerguelen2015	729	479	7	2.839
casey2017	613	405	6	1.858
maudrise2014	526	353	7	1.288
kerguelen2005	465	269	7	2.176
ballenyislands2015	309	205	7	1.875
greenwich2015	171	113	6	1.527
rosssea2014	10	8	1	0.929

The task 4 ATBFL (AL for Bioacoustics) development set is available here, including additional information about dataset composition:

BioDCASE 2026 Task 4 - ATBFL Dataset (183 MB)

Baselines

The classification head (model) will be randomly initialised during evaluation. The sampling method will be evaluated across 5 independent runs and the average performance across all runs used for ranking.

Random: samples are selected randomly, simulating active learning not being applied.
Margin (link): selects samples where the model is least confident, based on the smallest gap between the top two predicted class probabilities. The uncertainty is aggregated across classes (for multilabel) using the mean uncertainty (Settles 2009).
CoreSet (link): greedily selects samples that are farthest from the current labelled set in embedding space, maximising coverage without considering model uncertainty (Sener and Savarese 2018).
TypiClust (link): Typical clustering selects the most representative unlabelled sample from each of the least-covered regions of the embedding space, balancing diversity with density-based sample quality (Hacohen et al. 2022).

Baseline configuration: A fixed batch size of 50 samples, with 10 model training epochs per AL cycle and a learning rate of 1e^-3 is used for all baselines, up to the max_budget of 500 samples.

Table 3. Summary of baseline method performance aggregated across development sets.
Method	AULC (mAP macro)	Computational cost (relative)	Computational cost (wall-time (s))	Annotation cost
Random	0.390	1.0	0.00131	966.85
Margin	0.399	1.0	0.00307	1267.75
CoreSet	0.460	1.0	5.04364	1019.05
TypiCluster	0.423	1.0	5.90751	958.10

The following section explains how these evaluation metrics are computed →

Evaluation

Evaluation Dataset - Please note that a separate evaluation set will not be released for this task. Please follow the instructions below to submit your development results and method. Your method will then be run on the held out evaluation set which will inform the final ranking. A tentative ranking for the development set is open and available here and will be updated as submissions come in.

Ranking metric

Area Under the Learning Curve (AULC) for the mAP (macro) up to the max_budget of 500 samples. This metric will be aggregated across both datasets (and subsets) for the final ranking. Hence participants must consider how their sampling method generalises across domains.

AULC = (1 / max_budget) * Σ mAP(n)   for n = 1, ..., max_budget

Supplementary Metrics

Additional metrics will be provided. These will not be used for overall ranking but will inform selection for the Jury Award.

Computational cost (training): The computational cost is defined as the relative increase in the number of model_parameters, training epochs epochs, number of active learning cycles within budget AL_cycles compared to the baseline configuration. This metric is therefore the percentage increase (or decrease) in cost.

cost = model_parameters * epochs * AL_cycles

relative_cost = cost_method / baseline_cost

Computational cost (sampling-cost): Time spent in the sampling step averaged across all AL cycles. Penalises computationally expensive sampling methods.

sampling_cost = (1 / T) * Σ sampling_time(t)   for t = 1, ..., T

Annotation cost: Active learning generally only considers the number of samples selected for annotation, however the cost of annotation for challenging samples is important to consider. Here the annotation cost is the number of events per selected sample (e.g. the more multilabel events the higher the cost).

annotation_cost = Σ events(i)   for i in selected_samples

Sub-domain performance: Both BirdSet and ATBFL are composed of several subsets corresponding to different locations. Performance for each of these locations will be reported.

Submission

General BioDCASE instructions are available here. This specifies submission naming conventions and the final report template. Please note some important differences:

Participants are required to include the edited sampling.py file for reproducibility and final ranking.
Instead of submitting predictions, participants submit a .yaml file containing metrics (e.g. AULC, computational cost etc).

Deliverables

Please submit a .zip file containing the following information. Please follow the general BioDCASE naming convention.

It is recommended that participant start within the notebook biodcase_example.ipynb in docs/. This notebook contains example usage and exports results in the correct format for submission.

Results ({method}_{dataset}_{lastname}.yaml): containing the sampling performance metrics of your sampling method. Note that if you export these results using the BaseAL learner.export() method, this will already be in the correct format.

Please specify repeat of 5 when generating results, this will compute the mean and SD.

margin_multilabel_ATBFL_BASEAL.yaml

submission_timestamp: '2026-03-22T15:57:31'
author_lastname: baseline
institute_abbreviation: BASEAL
sampling_strategy: margin_multilabel
dataset: ATBFL_BASEAL
model: perch_v2
config:
  learning_rate: 0.001
  model_parameters: 2373128
  n_outer_repeats: 5
  pretrain_samples: 0
learning_curve:
- cycle: 1
  n_labeled: 50
  mAP_mean: 0.36654
  mAP_sd: 0.066797
  aulc_mAP_mean: 0.18327
  aulc_mAP_sd: 0.033398
  annotation_cost: 144
  sampling_time_s_mean: 0.003362
  sampling_time_s_sd: 0.000203
- cycle: 2
  n_labeled: 100
  mAP_mean: 0.384691
  mAP_sd: 0.05046
  aulc_mAP_mean: 0.279443
  aulc_mAP_sd: 0.045983
  annotation_cost: 131
  sampling_time_s_mean: 0.003424
  sampling_time_s_sd: 0.00042
- cycle: 3
  n_labeled: 150
  mAP_mean: 0.411882
  mAP_sd: 0.047491
  aulc_mAP_mean: 0.319057
  aulc_mAP_sd: 0.045024
  annotation_cost: 176
  sampling_time_s_mean: 0.003471
  sampling_time_s_sd: 0.000303
- cycle: 4
  n_labeled: 200
  mAP_mean: 0.431502
  mAP_sd: 0.033669
  aulc_mAP_mean: 0.344716
  aulc_mAP_sd: 0.042434
  annotation_cost: 177
  sampling_time_s_mean: 0.003403
  sampling_time_s_sd: 0.000249
- cycle: 5
  n_labeled: 250
  mAP_mean: 0.446552
  mAP_sd: 0.02969
  aulc_mAP_mean: 0.363578
  aulc_mAP_sd: 0.039633
  annotation_cost: 146
  sampling_time_s_mean: 0.00334
  sampling_time_s_sd: 0.000316
- cycle: 6
  n_labeled: 300
  mAP_mean: 0.463594
  mAP_sd: 0.01702
  aulc_mAP_mean: 0.378827
  aulc_mAP_sd: 0.036091
  annotation_cost: 202
  sampling_time_s_mean: 0.00332
  sampling_time_s_sd: 0.0003
- cycle: 7
  n_labeled: 350
  mAP_mean: 0.465332
  mAP_sd: 0.019022
  aulc_mAP_mean: 0.391061
  aulc_mAP_sd: 0.032631
  annotation_cost: 205
  sampling_time_s_mean: 0.003468
  sampling_time_s_sd: 0.000429
- cycle: 8
  n_labeled: 400
  mAP_mean: 0.47003
  mAP_sd: 0.022713
  aulc_mAP_mean: 0.400638
  aulc_mAP_sd: 0.030427
  annotation_cost: 156
  sampling_time_s_mean: 0.003332
  sampling_time_s_sd: 0.000335
- cycle: 9
  n_labeled: 450
  mAP_mean: 0.474832
  mAP_sd: 0.010245
  aulc_mAP_mean: 0.408615
  aulc_mAP_sd: 0.028394
  annotation_cost: 175
  sampling_time_s_mean: 0.003634
  sampling_time_s_sd: 0.000646
- cycle: 10
  n_labeled: 500
  mAP_mean: 0.476957
  mAP_sd: 0.008449
  aulc_mAP_mean: 0.415343
  aulc_mAP_sd: 0.026245
  annotation_cost: 158
  sampling_time_s_mean: 0.003644
  sampling_time_s_sd: 0.000389
- cycle: 11
  n_labeled: 550
  mAP_mean: 0.477933
  mAP_sd: 0.006941
  aulc_mAP_mean: 0.420989
  aulc_mAP_sd: 0.024222
  annotation_cost: 167
  sampling_time_s_mean: 0.003398
  sampling_time_s_sd: 0.000395
- cycle: 12
  n_labeled: 600
  mAP_mean: 0.479481
  mAP_sd: 0.009083
  aulc_mAP_mean: 0.425799
  aulc_mAP_sd: 0.022404
  annotation_cost: 145
  sampling_time_s_mean: 0.003368
  sampling_time_s_sd: 0.00026
- cycle: 13
  n_labeled: 650
  mAP_mean: 0.482552
  mAP_sd: 0.007105
  aulc_mAP_mean: 0.430046
  aulc_mAP_sd: 0.02098
  annotation_cost: 152
  sampling_time_s_mean: 0.003418
  sampling_time_s_sd: 0.000288
- cycle: 14
  n_labeled: 700
  mAP_mean: 0.481459
  mAP_sd: 0.009624
  aulc_mAP_mean: 0.433758
  aulc_mAP_sd: 0.01964
  annotation_cost: 132
  sampling_time_s_mean: 0.003404
  sampling_time_s_sd: 0.000308
- cycle: 15
  n_labeled: 750
  mAP_mean: 0.482672
  mAP_sd: 0.013172
  aulc_mAP_mean: 0.436978
  aulc_mAP_sd: 0.018296
  annotation_cost: 124
  sampling_time_s_mean: 0.003383
  sampling_time_s_sd: 0.000296
supplementary:
  n_cycles: 15
  n_outer_repeats: 5
  total_annotation_cost_mean: 2270.0
  total_annotation_cost_sd: 127.54
  total_sampling_time_s_mean: 0.05137
  total_sampling_time_s_sd: 0.004374
  computational_cost:
    model_parameters: 2373128
    epochs_per_cycle: 10
    n_cycles: 15
    cost_method: 355969200
    baseline_n_cycles: 15
    baseline_cost: 355969200
    relative_cost: 1.0

Sampling Method:

sampling.py: sampling method. Follow instructions for setting up _custom sampling method.
requirements.txt (optional): provide additional dependencies that were installed if any.
method_description.ipynb (optional): example usage of your sampling method. Highly recommended if you are using a batch size scheduler.

Report (.pdf): final report (see submissions page for details)

Leaderboard

(Optional) During the development phase, a running leader board will be available on the BaseAL webpage. If you wish to participate then send the submission package (without report) directly to Ben McEwen.

Rules

Participants should not change the classification head (model.py) or core active learning loop (active_learner.py) within BaseAL.
Dataset specific sampling methods and configurations are not allowed, your method should be generalisable across datasets.
Making use of dataset specific metadata (e.g. time, location etc) is allowed, however the sampling method should still run on datasets for which this data is not available.
Sampling methods and configurations should be designed to be run on a single consumer-grade GPU. Please avoid small batch sizes and computationally expensive approaches that require HPC access. Submissions with significant computational overhead are not feasible for bioacoustic monitoring and will not be ranked.
Labels are provided for oracle labelling, do not use labels within labels.csv for sampling.

Citation

If you use BaseAL or either of the curated datasets in your research please cite as follows.

BaseAL

@software{mcewen_baseal,
  author={McEwen, Ben and Zhang, Shiqi},
  title={{BaseAL}: Active Learning Baseline},
  year={2026},
  version={v1.1.1},
  publisher={Zenodo},
  doi={10.5281/zenodo.18467564},
  url={https://doi.org/10.5281/zenodo.18467564}
}

BioDCASE 2026 Task 4 - BirdSet Dataset

@dataset{rauch_2026_19191603,
  author       = {Rauch, Lukas and
                  Herde, Marek and
                  McEwen, Ben},
  title        = {BioDCASE 2026 Task 4: BirdSet Dataset},
  month        = mar,
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.19191602},
  url          = {https://doi.org/10.5281/zenodo.19191602},
}

Please cite the original source as well.

BioDCASE 2026 Task 4 - ATBFL Dataset

@dataset{kurinchi_vendhan_2026_19133112,
  author= {Kurinchi-Vendhan, Rupa and
                  Zhang, Shiqi and
                  McEwen, Ben},
  title        = {BioDCASE 2026 Task 4: ATBFL Dataset},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.19133111},
  url          = {https://doi.org/10.5281/zenodo.19133111},
}

Please cite the original source as well.

Support

The Active Learning for Bioacoustics community slack channel will now be hosted within the new BioDCASE Community workspace. If you are not yet a member already please join and also join the #task4-active-learning channel

If you have specific questions, not relevant to other participants, please email Ben McEwen .

	Ben McEwen Tilburg University
	Lukas Rauch Kassel University
	Marek Herde Kassel University
	Shiqi Zhang Tampere University
	Rupa Kurinchi-Vendhan Massachusetts Institute of Technology
	John Martinsson Research Institute of Sweden
	Sara Beery Massachusetts Institute of Technology

Coordinators

Content

Overview

Description

BaseAL

Dataset

Structure

BirdSet (AL for Bioacoustics)

ATBFL (AL for Bioacoustics)

Baselines

Evaluation

Ranking metric

Supplementary Metrics

Submission

Deliverables

margin_multilabel_ATBFL_BASEAL.yaml

Leaderboard

Rules

Citation

BaseAL

BioDCASE 2026 Task 4 - BirdSet Dataset

BioDCASE 2026 Task 4 - ATBFL Dataset

Support