๐ค AI Summary
This work addresses the weak correlation and limited interpretability of conventional audio-visual speech enhancement (AVSE) methods that rely on metrics such as SI-SNR or MSE, which poorly align with human perceptual quality. To bridge this gap, the authors propose a novel reinforcement learning framework that leverages a large audio language model to generate semantically rich natural language descriptions of enhanced speech. These descriptions are then converted into interpretable 1โ5 numeric reward signals via an emotion analysis model, which guide the fine-tuning of a pre-trained AVSE model using the Proximal Policy Optimization (PPO) algorithm. This approach represents the first use of large language modelโgenerated natural language feedback as a reinforcement learning reward in AVSE, significantly improving alignment between optimization objectives and human perception. Evaluated on the AVSEC-4 dataset, the method outperforms both supervised baselines and a DNSMOS-based reinforcement learning baseline across objective metrics (PESQ, STOI, neural quality scores) and subjective listening tests.
๐ Abstract
In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.