Seeing What You Say: Expressive Image Generation from Speech

📅 2025-11-05

📈 Citations: 1

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses end-to-end speech-to-image generation—producing semantically accurate and emotionally consistent images directly from raw speech, bypassing automatic speech recognition (ASR) as an intermediate step. To this end, we propose VoxStudio, a unified framework featuring: (i) a speech information bottleneck module that compresses raw speech into compact tokens encoding semantic, prosodic, and affective information; (ii) VoxEmoset, the first large-scale emotional speech–image paired dataset; and (iii) a cross-modal alignment mechanism enabling joint optimization of speech representations and the image generator. Experiments on SpokenCOCO, Flickr8kAudio, and VoxEmoset demonstrate substantial improvements in emotional consistency and visual fidelity of generated images. Results validate the critical role of paralinguistic modeling—particularly prosody and emotion—in speech-driven multimodal generation, establishing a novel paradigm for direct speech-conditioned visual synthesis.

Technology Category

Application Category

📝 Abstract

This paper proposes VoxStudio, the first unified and end-to-end speech-to-image model that generates expressive images directly from spoken descriptions by jointly aligning linguistic and paralinguistic information. At its core is a speech information bottleneck (SIB) module, which compresses raw speech into compact semantic tokens, preserving prosody and emotional nuance. By operating directly on these tokens, VoxStudio eliminates the need for an additional speech-to-text system, which often ignores the hidden details beyond text, e.g., tone or emotion. We also release VoxEmoset, a large-scale paired emotional speech-image dataset built via an advanced TTS engine to affordably generate richly expressive utterances. Comprehensive experiments on the SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the feasibility of our method and highlight key challenges, including emotional consistency and linguistic ambiguity, paving the way for future research.

Problem

Research questions and friction points this paper is trying to address.

Generating expressive images directly from spoken descriptions without text conversion

Preserving emotional nuance and prosody through speech information bottleneck compression

Addressing emotional consistency and linguistic ambiguity in speech-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates expressive images directly from speech

Uses speech information bottleneck to preserve emotional nuance

Eliminates need for speech-to-text system

🔎 Similar Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads