🤖 AI Summary
This work proposes Bagpiper, an 8-billion-parameter universal audio foundation model that overcomes the limitations of existing approaches reliant on task-specific supervision. By leveraging "rich captions" imbued with cognitive semantics, Bagpiper establishes bidirectional mappings between raw audio and high-level semantic representations, enabling unified understanding and generation across speech, music, and sound effects. Adopting a caption-first processing paradigm, the model requires no task-specific priors and supports open-ended audio tasks. Trained on a massive corpus of 600 billion tokens and refined through caption-guided fine-tuning, Bagpiper achieves state-of-the-art performance on MMAU and AIRBench benchmarks, outperforming Qwen-2.5-Omni in comprehension tasks and surpassing CosyVoice3 and TangoFlux in generation quality. This represents the first model to unify audio understanding and arbitrary compositional generation within a single architecture, breaking down traditional task isolation barriers.
📝 Abstract
Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.