Task 1: Multichannel Alignment


Task description

Coordinators

Aditya Bhattacharjee
Aditya Bhattacharjee

Queen Mary University of London

Lisa Gill
Becky Heath
Becky Heath

University of Cambridge

Gagan Narula

Researchers often deploy multiple audio recorders simultaneously, for example as passive automated recording units (ARUs) or embedded in animal-borne bio-loggers. Analysing sounds captured simultaneously by multiple recorders can provide insights into animal positions and numbers, as well as the dynamics of communication in groups. However, many of these devices are susceptible to desynchronization due to nonlinear clock drift, which can diminish researchers' ability to glean useful insights. Therefore, a reliable post-processing resynchronization method would increase the usability of the collected data.

In this challenge, participants will be presented with pairs of temporally desynchronized recordings and asked to design a system to synchronize them in time. In the development phase, participants are provided with audio pairs and a small set of ground-truth synchronization keypoints, such as those that could be produced through manual review of the data. In the evaluation phase, participants' systems are ranked by their ability to synchronize unseen audio pairs.

Description

Each dataset consists of a set of stereo audio files. The audio in the two channels of each audio file is not synchronized in time due to nonlinear clock drift. Each audio file has a corresponding set of annotations, \(k_0,\dots,k_{114}\), called keypoints. Each keypoint \(k_i=(k_{i,0}, k_{i,1})\) consists of a timestamp \(k_{i,0}\) for Channel 0 and a timestamp \(k_{i,1}\) for Channel 1. The timestamps in each channel correspond to the same time in the physical world, but due to clock drift they do not appear at the same time in the recordings. The timestamps \(k_{i,0}\) in Channel 0 occur at 1-second intervals. Timestamps always occur within the actual duration of the audio file, which means that for some files there are repeated timestamps at the beginning (to avoid negative timestamps) or at the end (to avoid exceeding the duration of the audio).

Figure 1: Visualization of keypoint-based alignment.


During training, systems have access to keypoints' timestamps in both Channels 0 and 1. During inference, systems have access only to keypoints' timestamps in Channel 0, and must predict the corresponding Channel 1 timestamps. Systems are evaluated based on mean error of their predicted Channel 1 timestamps, compared to ground-truth Channel 1 timestamps. The official validation code reports both mean squared error (MSE, in sec\(^2\)) and mean absolute error (MAE, in ms). In practice, MAE in milliseconds is often easier to interpret.

Audio dataset

The challenge uses two datasets: aru and zebra_finch. The train and validation (val) portions of these datasets, which include audio and ground-truth keypoints, can be found here. The test portion, which includes only audio, will be provided during the evaluation phase of BioDCASE 2026. The domain shift between train and validation sets reflects the domain shift between train and evaluation sets.

In both datasets, desynchronization includes a constant time shift between the two channels, as well as nonlinear clock drift within each file. However, the total desynchronization depends on the dataset.

  • The aru dataset follows a fine-grained regime and the total desynchronization never exceeds +/- 0.5 seconds.
  • The zebra_finch dataset follows a coarse regime, and the total desynchronization never exceeds +/- 5 seconds.

The directory structure of the formatted datasets is:

formatted_data
├── aru
│   ├── train
│   │   ├── annotations.csv
│   │   └── audio
│   │       └── *.wav
│   └── val
│       ├── annotations.csv
│       └── audio
│           └── *.wav
└── zebra_finch
    ├── train
    │   ├── annotations.csv
    │   └── audio
    │       └── *.wav
    └── val
        ├── annotations.csv
        └── audio
            └── *.wav

Evaluation and Baseline System

Code for model evaluation, baseline systems, and example usage can be found here.

Baselines included:

  • nosync, in which no synchronization is performed
  • gccphat, a signal-processing baseline based on GCC-PHAT
  • deeplearning, a deep learning system trained on the provided training data.

For evaluation, model outputs are expected to be in the same format as the provided keypoint annotations, i.e. a .csv file with three columns Filename, Time Channel 0, and Time Channel 1.

The baseline repository contains separate training and inference scripts, dataset-specific hyperparameter files for aru and zebra_finch, and an example shell script showing how to reproduce the baseline runs. The deep learning baseline additionally requires a BEATs checkpoint, whereas the GCC-PHAT baseline can be run directly from the provided audio and annotations.

Deep learning baseline description

Overview

The deep learning system is a binary classifier trained to score whether a pair of mono audio windows, one from each channel, is aligned in time. Training is supervised using the annotated keypoints in the development data. At inference time, the model is used within an offset search procedure to predict the Channel 1 timestamp for each Channel 0 keypoint. The baseline is configured separately for the two datasets. For aru, the model operates in a fine-grained regime with a maximum error of 0.5 seconds. For zebra_finch, it operates in a coarse regime with a maximum error of 5.0 seconds. These settings reflect the differing alignment scales of the two datasets.

Technical details

All audio is resampled to 16 kHz. For each annotated keypoint, the training code extracts a local context window around the Channel 0 timestamp and constructs aligned or misaligned mono clip pairs from the two channels. For each window pair, features are extracted with a pre-trained BEATs encoder. The resulting embeddings are pooled in time and combined as \([z_0, z_1, |z_0-z_1|]\), then passed to a lightweight MLP head.

Positive training pairs are formed from aligned keypoints. Negative pairs are generated by perturbing the Channel 1 start time by a sampled offset within the dataset-specific alignment range. The main objective is binary cross-entropy, with an optional auxiliary contrastive term that encourages stronger agreement among positive pairs within the batch.

Inference uses an offset search procedure that scores candidate alignments with the trained classifier. In broad terms, the system first evaluates candidate offsets at a coarse level, then refines the final prediction according to the dataset-specific search strategy:

  • For a subset of keypoints, score candidate offsets on a coarse grid and select a smooth offset path using dynamic programming.
  • Interpolate this coarse path to all keypoints.
  • For each keypoint, refine by scoring a fine grid around the interpolated offset.

In the provided configurations, aru uses a local smooth-search setting with keypoint subsampling and refinement, while zebra_finch uses a broader global affine search with larger offset steps and no final refinement. This difference mirrors the fine versus coarse alignment regimes of the two datasets.

Results of baseline systems on validation set

We report both MAE (ms) and MSE (sec\(^2\)) on validation sets; lower is better and perfect alignment is achieved when both metrics equal \(0\).

Model aru zebra_finch
MAE (ms) MSE (sec2) MAE (ms) MSE (sec2)
nosync 116.2 0.019 2297.4 5.734
gccphat 122.4 0.020 2457.0 7.995
deeplearning 164.3 0.040 757.1 1.218

We conducted baseline experiments with CUDA=13.0 on one H100 GPU. Results are expected to be broadly similar, but they may differ when using different versions of CUDA or different GPU hardware.

External Data Resources

The use of external resources (data sets, pretrained models) is allowed under the following conditions:

  • The external resource must be freely accessible to any other research group in the world. The resource must be public and freely available before April 1st 2026.
  • The list of external resources used in training must be clearly indicated in the technical report.

Task Rules

There are general rules valid for all tasks; these, along with information on technical report and submission requirements, can be found here.

Task-specific rules:

  • Participants may submit predictions for up to three systems.
  • The models' predictions on both aru and zebra_finch test datasets must be included in the submission.

Tools

Baseline systems and official evaluation code can be found here. The deeplearning baseline requires weights for the BEATs feature extractor, which can be obtained here.

Citation

If you use the data provided in a publication, please cite the DOI available here. If you use the code provided in a publication, please follow the citation format provided in the github repo.

Support

If you have questions, please use the BioDCASE Google Groups community forum or contact the task organizers at abhattacharjee@qmul.ac.uk.