LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the weak correlation and limited interpretability of conventional audio-visual speech enhancement (AVSE) methods that rely on metrics such as SI-SNR or MSE, which poorly align with human perceptual quality. To bridge this gap, the authors propose a novel reinforcement learning framework that leverages a large audio language model to generate semantically rich natural language descriptions of enhanced speech. These descriptions are then converted into interpretable 1–5 numeric reward signals via an emotion analysis model, which guide the fine-tuning of a pre-trained AVSE model using the Proximal Policy Optimization (PPO) algorithm. This approach represents the first use of large language model–generated natural language feedback as a reinforcement learning reward in AVSE, significantly improving alignment between optimization objectives and human perception. Evaluated on the AVSEC-4 dataset, the method outperforms both supervised baselines and a DNSMOS-based reinforcement learning baseline across objective metrics (PESQ, STOI, neural quality scores) and subjective listening tests.

Technology Category

Application Category

📝 Abstract

In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Enhancement

perceptual quality

interpretable reward

reinforcement learning

Large Language Model

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided reinforcement learning

interpretable reward

audio-visual speech enhancement