EP008 — When Birdsong Hears Elephants (Birdsong to Rumbles)
Can a model trained only on birdsong classify elephant calls — without any fine-tuning at all? A new paper from Geldenhuys and Niesler runs frozen-embedding transfer from bird-trained and speech-trained foundation models to African and Asian elephant calls, and gets within 2.2 percent of an end-to-end supervised baseline. Even more striking: the second layer of the network outperforms the final layer, and ten percent of the parameters do most of the work. Cross-domain parallel: the convergent evolution of vocal learning across songbirds, parrots, cetaceans, and elephants — different anatomies, different ecologies, similar architectural primitives.
Cross-domain connection
Convergent evolution of vocal learning across the four documented vocal-learning taxa — songbirds, parrots, cetaceans, and elephants. These groups evolved vocal learning independently, on different time scales, with very different anatomies (a syrinx in birds, a larynx in mammals) and under different ecological pressures, and converged on similar architectural primitives — hierarchical compositional structure of acoustic units, motor sequencing under auditory feedback. Holds on the abstract architecture being what's shared across very different surface conditions, in both biology and ML; in transformers, intermediate layers carry the abstract compositional structure, final layers specialize. Breaks on selection-pressure asymmetry (open-ended biological evolution over millions of generations vs. fixed loss functions over weeks of training) and substrate granularity (dedicated brain circuits vs. homogeneous transformer parameters). Forward question: where does frozen-embedding transfer hit a ceiling that fine-tuning could break through, and what is in that 2.2 percent gap?
Concepts introduced
- Frozen embeddings (pretrained model as fixed feature extractor — no fine-tuning, no adapters)
- Foundation models for audio: Perch (birdsong-trained), wav2vec 2.0 and HuBERT (human-speech-trained)
- Out-of-species transfer (training-domain disjoint from deployment-domain)
- Infrasound (1–30 Hz, below human hearing) — the elephant rumble band
- Linear probing / lightweight downstream classifier as the thin learnable layer
- AUC as the classification quality metric (briefly named)
- Layerwise representation analysis — the empirical move of asking which layer's embedding works best
- Intermediate-layer dominance — the surprise that deeper isn't better for transfer
- Convergent evolution as a comparative-biology concept (briefly named, scoped to vocal learning)