EP008 — When Birdsong Hears Elephants (Birdsong to Rumbles)

Episode 8·May 6, 2026·14 min

Can a model trained only on birdsong classify elephant calls — without any fine-tuning at all? A new paper from Geldenhuys and Niesler runs frozen-embedding transfer from bird-trained and speech-trained foundation models to African and Asian elephant calls, and gets within 2.2 percent of an end-to-end supervised baseline. Even more striking: the second layer of the network outperforms the final layer, and ten percent of the parameters do most of the work. Cross-domain parallel: the convergent evolution of vocal learning across songbirds, parrots, cetaceans, and elephants — different anatomies, different ecologies, similar architectural primitives.

Cross-domain connection

Convergent evolution of vocal learning across the four documented vocal-learning taxa — songbirds, parrots, cetaceans, and elephants. These groups evolved vocal learning independently, on different time scales, with very different anatomies (a syrinx in birds, a larynx in mammals) and under different ecological pressures, and converged on similar architectural primitives — hierarchical compositional structure of acoustic units, motor sequencing under auditory feedback. Holds on the abstract architecture being what's shared across very different surface conditions, in both biology and ML; in transformers, intermediate layers carry the abstract compositional structure, final layers specialize. Breaks on selection-pressure asymmetry (open-ended biological evolution over millions of generations vs. fixed loss functions over weeks of training) and substrate granularity (dedicated brain circuits vs. homogeneous transformer parameters). Forward question: where does frozen-embedding transfer hit a ceiling that fine-tuning could break through, and what is in that 2.2 percent gap?

Concepts introduced

Frozen embeddings (pretrained model as fixed feature extractor — no fine-tuning, no adapters)
Foundation models for audio: Perch (birdsong-trained), wav2vec 2.0 and HuBERT (human-speech-trained)
Out-of-species transfer (training-domain disjoint from deployment-domain)
Infrasound (1–30 Hz, below human hearing) — the elephant rumble band
Linear probing / lightweight downstream classifier as the thin learnable layer
AUC as the classification quality metric (briefly named)
Layerwise representation analysis — the empirical move of asking which layer's embedding works best
Intermediate-layer dominance — the surprise that deeper isn't better for transfer
Convergent evolution as a comparative-biology concept (briefly named, scoped to vocal learning)

Source paper

Christiaan M. Geldenhuys, Thomas R. Niesler — *From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings* (arXiv 2605.00225, 2026-04-30)

EP008 — When Birdsong Hears Elephants (Birdsong to Rumbles)

Cross-domain connection

Concepts introduced

Source paper

Subscribe