🤖 AI Summary
This work addresses the limitations of current vision generation models, which rely on text prompts that often fail to precisely convey spatial structures and fine-grained visual details. To overcome this, the authors propose a vision-to-vision (V2V) generation paradigm that replaces textual prompts with visual specification sheets, leveraging a frozen vision-language model to jointly embed images and text into a unified conditional space for generation. They introduce V2V-Zero, a training-free framework enabling purely visual conditioning on off-the-shelf generative models, along with Simple-V2V Bench, a new benchmark for evaluation. By utilizing the final-layer hidden states of the vision-language model combined with attention mechanisms, the method achieves vision-guided generation control. Experiments show competitive performance, attaining a GenEval score of 0.85—approaching that of text-based generation—and outperforming open-source baselines with scores of 32.7/100 and 20.2/100 on image and video tasks, respectively.
📝 Abstract
Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space.
On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.