Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source diffusion models struggle to achieve fine-grained and stable cross-modal synchronization when generating audio-visual content for complex semantic scenes, primarily due to their reliance on coarse text embeddings. To address this limitation, this work proposes Baton, a novel framework that introduces explicit semantic planning into joint audio-visual generation for the first time. Baton employs a VA-Planner module to jointly reason about and mutually align audio and visual plan tokens prior to denoising, serving as keyframe-level blueprints that coordinate the generation trajectory. Furthermore, it incorporates a Relative Semantic RoPE mechanism to enable spatiotemporal alignment between plan tokens and diffusion latent variables. Experimental results demonstrate that Baton significantly improves audio-visual synchronization, semantic consistency, and overall generation quality across multiple benchmarks.
📝 Abstract
Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.
Problem

Research questions and friction points this paper is trying to address.

joint video-audio generation
semantic planning
cross-modal alignment
diffusion models
fine-grained semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic planning
joint video-audio generation
planned tokens
cross-modal alignment
Relative Semantic RoPE
🔎 Similar Papers
No similar papers found.