Latent Action Control for Reasoning-Guided Unified Image Generation

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

While existing unified multimodal models jointly support visual understanding and image generation, they lack effective control over generated content. This work proposes Latent Action Control (LAC), which, for the first time, formulates the inference process as a learnable continuous latent action trajectory within a unified generator. LAC leverages a role-structured latent space to enable planning, sketch-guided generation, diagnosis, and optimization. The approach integrates prior-guided variational latent action alignment, semantic and sketch-based supervision, and latent-flow GRPO reinforcement optimization to establish a closed-loop, controllable pathway from understanding to generation. Evaluated on GenEval, WISE, and T2I-CompBench benchmarks, LAC substantially improves compositional and knowledge-driven generation performance, particularly excelling in tasks involving spatial relations, attribute binding, and world-knowledge-sensitive prompts.

📝 Abstract

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

Problem

Research questions and friction points this paper is trying to address.

latent action

reasoning-guided generation

unified multimodal models

image generation control

compositional generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Action Control

Unified Image Generation

Reasoning-Guided Control