Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

πŸ“… 2026-01-16
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language models lack fine-grained spatial and physical reasoning capabilities, making it challenging to perform single-pass inverse reconstruction from images to editable graphics programs. This work proposes VIGA, a task-agnostic and model-agnostic general-purpose inverse graphics generation framework. VIGA operates through a closed-loop pipeline of writing, executing, rendering, comparing, and revising code, leveraging a skill library that enables dynamic role switching between generator and validator, an evolutionary context memory that maintains planning states, code deltas, and rendering history, and tight integration with a graphics engine. Starting from an empty world, VIGA iteratively achieves 3D reconstruction, multi-step editing, and 4D physical interaction. It demonstrates significant performance gainsβ€”35.32%, 117.17%, and 124.70%β€”on BlenderGym, SlideBench, and the newly introduced BlenderBench, respectively.

Technology Category

Application Category

πŸ“ Abstract
Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.
Problem

Research questions and friction points this paper is trying to address.

Vision-as-Inverse-Graphics
spatial grounding
physical grounding
multimodal reasoning
graphics program reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

interleaved multimodal reasoning
vision-as-inverse-graphics
closed-loop reconstruction
skill library
context memory
πŸ”Ž Similar Papers
No similar papers found.