Fun-Audio-Chat Technical Report

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing joint speech-text models face three key challenges: (1) temporal resolution mismatch between speech tokens (~25 Hz) and text tokens (~3 Hz), causing semantic dilution; (2) high computational overhead; and (3) catastrophic forgetting of large language model (LLM) knowledge during fine-tuning. To address these, we propose Dual-Resolution Speech Representation (DRSR) and Cocktail Training—enabling the first simultaneous fine-grained speech modeling and coarse-grained textual semantic alignment. We design a shared LLM architecture with a speech refinement head and a two-stage fusion fine-tuning strategy, augmented by multi-task Direct Preference Optimization (DPO) to enhance empathy and instruction-following capabilities. The resulting models—Fun-Audio-Chat-8B and MoE-30B-A3B—achieve state-of-the-art performance on spoken question answering, audio understanding, and speech-based function calling at their respective scales. All models, code, and a full-duplex interactive demo are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo.

Problem

Research questions and friction points this paper is trying to address.

Resolves temporal mismatch between speech and text token rates

Prevents catastrophic forgetting of text knowledge during audio training

Reduces computational costs while maintaining audio quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Resolution Speech Representations balance efficiency and quality

Core-Cocktail Training mitigates catastrophic forgetting via merging

Multi-Task DPO Training enhances robustness and audio understanding

🔎 Similar Papers

No similar papers found.