X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

πŸ“… 2026-03-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
End-to-end speech large language models (LLMs) significantly underperform their text-only counterparts on complex tasks, and existing fine-tuning and reinforcement learning approaches struggle to effectively bridge this performance gap. To address this limitation, this work proposes X-OPD, a cross-modal online policy distillation framework that, for the first time, enables speech LLMs to autonomously generate response trajectories while receiving token-level feedback from a text-based teacher model. This mechanism facilitates fine-grained alignment of multimodal representations through online policy sampling, cross-modal knowledge distillation, and precise feedback signals. Evaluated across multiple benchmarks, X-OPD substantially narrows the performance gap between speech and text LLMs while preserving the speech model’s inherent acoustic and paralinguistic capabilities.
πŸ“ Abstract
While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.
Problem

Research questions and friction points this paper is trying to address.

Speech LLMs
capability alignment
performance gap
end-to-end models
text-based counterparts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Distillation
On-Policy Learning
Speech LLMs
Capability Alignment
Token-level Feedback
πŸ”Ž Similar Papers
No similar papers found.