X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

End-to-end speech large language models (LLMs) significantly underperform their text-only counterparts on complex tasks, and existing fine-tuning and reinforcement learning approaches struggle to effectively bridge this performance gap. To address this limitation, this work proposes X-OPD, a cross-modal online policy distillation framework that, for the first time, enables speech LLMs to autonomously generate response trajectories while receiving token-level feedback from a text-based teacher model. This mechanism facilitates fine-grained alignment of multimodal representations through online policy sampling, cross-modal knowledge distillation, and precise feedback signals. Evaluated across multiple benchmarks, X-OPD substantially narrows the performance gap between speech and text LLMs while preserving the speech model’s inherent acoustic and paralinguistic capabilities.

Technology Category

Application Category

📝 Abstract

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

Problem

Research questions and friction points this paper is trying to address.

Speech LLMs

capability alignment

performance gap

end-to-end models

text-based counterparts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Distillation

On-Policy Learning

Speech LLMs

Capability Alignment

Token-level Feedback

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models

2024-09-30arXiv.orgCitations: 3