Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

While existing speech language models exhibit strong semantic understanding capabilities, they often generate speech lacking expressiveness, failing to bridge the gap between semantic intent and acoustic realization. This work proposes Self-Aware Speech Language Model (SA-SLM), which introduces an intent-aware bridging mechanism and a gauge-based feedback strategy for realization-aware alignment, endowing the model with dual awareness of both internal semantic intent and external acoustic expression. By leveraging a variational information bottleneck to extract smooth intent representations and employing the model itself as a critic to align acoustic outputs with semantic intent, SA-SLM achieves remarkable performance. Trained on only 800 hours of expressive speech data, the 3B-parameter SA-SLM significantly outperforms all open-source baselines on the EchoMind benchmark, trailing GPT-4o-Audio by merely 0.08 points in overall expressiveness.

Technology Category

Application Category

📝 Abstract

Speech Language Models (SLMs) exhibit strong semantic understanding, yet their generated speech often sounds flat and fails to convey expressive intent, undermining user engagement. We term this mismatch the semantic understanding-acoustic realization gap. We attribute this gap to two key deficiencies: (1) intent transmission failure, where SLMs fail to provide the stable utterance-level intent needed for expressive delivery; and (2) realization-unaware training, where no feedback signal verifies whether acoustic outputs faithfully reflect intended expression. To address these issues, we propose SA-SLM (Self-Aware Speech Language Model), built on the principle that the model should be aware of what it thinks during generation and how it speaks during training. SA-SLM addresses this gap through two core contributions: (1) Intent-Aware Bridging, which uses a Variational Information Bottleneck (VIB) objective to translate the model's internal semantics into temporally smooth expressive intent, making speech generation aware of what the model intends to express; and (2) Realization-Aware Alignment, which repurposes the model as its own critic to verify and align acoustic realization with intended expressive intent via rubric-based feedback. Trained on only 800 hours of expressive speech data, our 3B parameter SA-SLM surpasses all open-source baselines and comes within 0.08 points of GPT-4o-Audio in overall expressiveness on the EchoMind benchmark.

Problem

Research questions and friction points this paper is trying to address.

Speech Language Models

expressive speech generation

semantic understanding-acoustic realization gap

intent transmission

realization-aware training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Aware Speech Language Model

Intent-Aware Bridging

Realization-Aware Alignment

Variational Information Bottleneck

Expressive Speech Generation

🔎 Similar Papers

No similar papers found.

Authors to Follow