Explore the Reinforcement Learning for the LLM based ASR and TTS system

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underexplored application of reinforcement learning (RL) in large language model (LLM)-driven automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present the first systematic investigation into RL adaptation for audio-modal generative tasks, proposing a lightweight audio-aware RL framework that integrates Group Relative Policy Optimization (GRPO) with Differentiable Reward Optimization (DiffRO). To enhance reliability and efficiency, we design a rule-guided differentiable reward function and a low-resource-friendly data construction strategy. Experiments demonstrate substantial improvements in ASR word error rate reduction and TTS naturalness—measured by MOS and SIM scores—under stringent constraints of limited labeled data and few optimization steps. Our results validate the feasibility, effectiveness, and generalization potential of RL for LLM-based speech generation, establishing a foundation for efficient, reward-driven audio modeling without extensive supervision.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL framework tailored for audio-based LLMs that can process audio inputs and generate audio outputs. Based on this framework, we evaluate the effectiveness of reinforcement learning on both ASR and TTS tasks. For the ASR task, we experiment with different rule-based reward functions within the Group Relative Policy Optimization (GRPO) framework and investigate the impact of RL data construction. For the TTS task, we compare GRPO with Differentiable Reward Optimization (DiffRO) and further combine the two approaches to achieve improved performance. Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems, even with limited training data and a small number of optimization steps.
Problem

Research questions and friction points this paper is trying to address.

Applying reinforcement learning to LLM-based ASR and TTS systems
Addressing training complexity for audio-based language models
Evaluating RL effectiveness with limited data and optimization steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight RL framework for audio-based LLMs
GRPO with rule-based rewards for ASR optimization
Combining GRPO and DiffRO for TTS enhancement
🔎 Similar Papers
No similar papers found.
C
Changfeng Gao
Speech Team, Tongyi Lab, Alibaba Group
Y
Yabin Li
Speech Team, Tongyi Lab, Alibaba Group
K
Keyu An
Speech Team, Tongyi Lab, Alibaba Group
Z
Zhifu Gao
Speech Team, Tongyi Lab, Alibaba Group
Zhihao Du
Zhihao Du
Alibaba
Speech separationspeech enchancementspeaker diarization
H
Han Zhao
Speech Team, Tongyi Lab, Alibaba Group
Xiangang Li
Xiangang Li
Unknown affiliation
speech recognitionnatural language processing