Researchers often deploy multiple audio recorders simultaneously, for example as passive automated recording units (ARUs) or embedded in animal-borne bio-loggers. Analysing sounds captured simultaneously by multiple recorders can provide insights into animal positions and numbers, as well as the dynamics of communication in groups. However, many of these devices are susceptible to desynchronization due to nonlinear clock drift, which can diminish researchers' ability to glean useful insights. Therefore, a reliable post-processing resynchronization method would increase the usability of the collected data.

In this challenge, participants will be presented with pairs of temporally desynchronized recordings and asked to design a system to synchronize them in time. In the development phase, participants are provided with audio pairs and a small set of ground-truth synchronization keypoints, such as those that could be produced through manual review of the data. In the evaluation phase, participants' systems are ranked by their ability to synchronize unseen audio pairs.

NEWS: Task 1 results are now available. View results

Description

Each dataset consists of a set of stereo audio files. The audio in the two channels of each audio file is not synchronized in time due to nonlinear clock drift. Each audio file has a corresponding set of annotations, \(k_0,\dots,k_{114}\), called keypoints. Each keypoint \(k_i=(k_{i,0}, k_{i,1})\) consists of a timestamp \(k_{i,0}\) for Channel 0 and a timestamp \(k_{i,1}\) for Channel 1. The timestamps in each channel correspond to the same time in the physical world, but due to clock drift they do not appear at the same time in the recordings. The timestamps \(k_{i,0}\) in Channel 0 occur at 1-second intervals. Timestamps always occur within the actual duration of the audio file, which means that for some files there are repeated timestamps at the beginning (to avoid negative timestamps) or at the end (to avoid exceeding the duration of the audio).

Figure 1: Visualization of keypoint-based alignment.

During training, systems have access to keypoints' timestamps in both Channels 0 and 1. During inference, systems have access only to keypoints' timestamps in Channel 0, and must predict the corresponding Channel 1 timestamps. Systems are evaluated based on mean error of their predicted Channel 1 timestamps, compared to ground-truth Channel 1 timestamps. The official validation code reports both mean squared error (MSE, in sec\(^2\)) and mean absolute error (MAE, in ms). In practice, MAE in milliseconds is often easier to interpret.

Audio dataset

The challenge uses two datasets: aru and zebra_finch. The train and validation (val) splits of these datasets include audio and ground-truth keypoints, whereas the test split includes only audio. All the splits can now be found here. The domain shift between train and validation sets reflects the domain shift between train and evaluation sets.

In both datasets, desynchronization includes a constant time shift between the two channels, as well as nonlinear clock drift within each file. However, the total desynchronization depends on the dataset.

The aru dataset follows a fine-grained regime and the total desynchronization never exceeds +/- 0.5 seconds.
The zebra_finch dataset follows a coarse regime, and the total desynchronization never exceeds +/- 5 seconds.

The directory structure of the formatted datasets is:

formatted_data
├── aru
│   ├── train
│   │   ├── annotations.csv
│   │   └── audio
│   │       └── *.wav
│   └── val
│       ├── annotations.csv
│       └── audio
│           └── *.wav
└── zebra_finch
    ├── train
    │   ├── annotations.csv
    │   └── audio
    │       └── *.wav
    └── val
        ├── annotations.csv
        └── audio
            └── *.wav

Evaluation and Baseline System

Code for model evaluation, baseline systems, and example usage can be found here.

Baselines included:

nosync, in which no synchronization is performed
gccphat, a signal-processing baseline based on GCC-PHAT
deeplearning, a deep learning system trained on the provided training data.

For evaluation, model outputs are expected to be in the same format as the provided keypoint annotations, i.e. a .csv file with three columns Filename, Time Channel 0, and Time Channel 1.

The baseline repository contains separate training and inference scripts, dataset-specific hyperparameter files for aru and zebra_finch, and an example shell script showing how to reproduce the baseline runs. The deep learning baseline additionally requires a BEATs checkpoint, whereas the GCC-PHAT baseline can be run directly from the provided audio and annotations.

Deep learning baseline description

Overview

The deep learning system is a binary classifier trained to score whether a pair of mono audio windows, one from each channel, is aligned in time. Training is supervised using the annotated keypoints in the development data. At inference time, the model is used within an offset search procedure to predict the Channel 1 timestamp for each Channel 0 keypoint. The baseline is configured separately for the two datasets. For aru, the model operates in a fine-grained regime with a maximum error of 0.5 seconds. For zebra_finch, it operates in a coarse regime with a maximum error of 5.0 seconds. These settings reflect the differing alignment scales of the two datasets.

Technical details

All audio is resampled to 16 kHz. For each annotated keypoint, the training code extracts a local context window around the Channel 0 timestamp and constructs aligned or misaligned mono clip pairs from the two channels. For each window pair, features are extracted with a pre-trained BEATs encoder. The resulting embeddings are pooled in time and combined as \([z_0, z_1, |z_0-z_1|]\), then passed to a lightweight MLP head.

Positive training pairs are formed from aligned keypoints. Negative pairs are generated by perturbing the Channel 1 start time by a sampled offset within the dataset-specific alignment range. The main objective is binary cross-entropy, with an optional auxiliary contrastive term that encourages stronger agreement among positive pairs within the batch.

Inference uses an offset search procedure that scores candidate alignments with the trained classifier. In broad terms, the system first evaluates candidate offsets at a coarse level, then refines the final prediction according to the dataset-specific search strategy:

For a subset of keypoints, score candidate offsets on a coarse grid and select a smooth offset path using dynamic programming.
Interpolate this coarse path to all keypoints.
For each keypoint, refine by scoring a fine grid around the interpolated offset.

In the provided configurations, aru uses a local smooth-search setting with keypoint subsampling and refinement, while zebra_finch uses a broader global affine search with larger offset steps and no final refinement. This difference mirrors the fine versus coarse alignment regimes of the two datasets.

Results of baseline systems on validation set

We report both MAE (ms) and MSE (sec\(^2\)) on validation sets; lower is better and perfect alignment is achieved when both metrics equal \(0\).

Model	aru		zebra_finch
Model	MAE (ms)	MSE (sec²)	MAE (ms)	MSE (sec²)
nosync	116.2	0.019	2297.4	5.734
gccphat	122.4	0.020	2457.0	7.995
deeplearning	164.3	0.040	757.1	1.218

We conducted baseline experiments with CUDA=13.0 on one H100 GPU. Results are expected to be broadly similar, but they may differ when using different versions of CUDA or different GPU hardware.

External Data Resources

The use of external resources (data sets, pretrained models) is allowed under the following conditions:

The external resource must be freely accessible to any other research group in the world. The resource must be public and freely available before April 1st 2026.
The list of external resources used in training must be clearly indicated in the technical report.

Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.

Task-specific rules:

Participants may submit predictions for up to three systems.
The models' predictions on both aru and zebra_finch test datasets must be included in the submission.

Tools

Baseline systems and official evaluation code can be found here. The deeplearning baseline requires weights for the BEATs feature extractor, which can be obtained here.

Citation

If you use the data provided in a publication, please cite the DOI available here. If you use the code provided in a publication, please follow the citation format provided in the github repo.

Support

If you have questions, please use the BioDCASE Google Groups community forum or contact the task organizers at abhattacharjee@qmul.ac.uk.

	Aditya Bhattacharjee Queen Mary University of London
	Lisa Gill Bavarian Society for the Protection of Birds and Nature
	Becky Heath University of Cambridge
	Gagan Narula Earth Species Project

Task 1: Multichannel Alignment

Coordinators

Description

Audio dataset

Evaluation and Baseline System

Deep learning baseline description

Overview

Technical details

Results of baseline systems on validation set

External Data Resources

Task Rules

Tools

Citation

Support

Coordinators

Content

Description

Audio dataset

Evaluation and Baseline System

Deep learning baseline description

Overview

Technical details

Results of baseline systems on validation set

External Data Resources

Task Rules

Tools

Citation

Support