CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

๐Ÿ“… 2025-12-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Multi-view diffusion models suffer from degraded geometric correspondence and insufficient view consistency under large viewpoint variations in novel view synthesis. We reveal that their attention maps implicitly learn cross-view geometric correspondences during training, yet this signal remains sparse and unstable. To address this, we propose CAMEO: a lightweight, model-agnostic method that imposes explicit geometric supervision on only a single layer of self-attention mapsโ€”requiring no auxiliary networks or human annotations. By enhancing structural alignment across views, CAMEO reduces required training iterations by 50% on multiple benchmarks while improving PSNR by 1.2โ€“2.3 dB at equal iteration counts. The generated views exhibit superior geometric consistency and visual quality, achieving state-of-the-art performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-view diffusion models' geometric correspondence accuracy
Reduces training iterations by half while improving synthesis quality
Provides a model-agnostic training technique for view-consistent generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervises attention maps using geometric correspondence
Guides model to learn precise correspondences in single layer
Reduces training iterations by half while improving performance
๐Ÿ”Ž Similar Papers
No similar papers found.