🤖 AI Summary
Existing open-source diffusion models struggle to achieve fine-grained and stable cross-modal synchronization when generating audio-visual content for complex semantic scenes, primarily due to their reliance on coarse text embeddings. To address this limitation, this work proposes Baton, a novel framework that introduces explicit semantic planning into joint audio-visual generation for the first time. Baton employs a VA-Planner module to jointly reason about and mutually align audio and visual plan tokens prior to denoising, serving as keyframe-level blueprints that coordinate the generation trajectory. Furthermore, it incorporates a Relative Semantic RoPE mechanism to enable spatiotemporal alignment between plan tokens and diffusion latent variables. Experimental results demonstrate that Baton significantly improves audio-visual synchronization, semantic consistency, and overall generation quality across multiple benchmarks.
📝 Abstract
Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.