I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

πŸ“… 2026-01-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing text-guided image editing methods, which struggle with precise local control and multi-object spatial reasoning in complex compositional tasks due to tightly coupled planning and execution, the absence of object-level representations, and pixel-centric modeling. To overcome these challenges, we propose the I2E framework, introducing a novel β€œdecompose-then-act” paradigm: a decomposer first generates manipulable object layers, and an embodied vision-language action agent then performs atomic editing operations within a structured interactive environment, guided by chain-of-thought reasoning. By integrating image decomposition, physics-aware vision-language modeling, and interpretable reasoning, I2E establishes an end-to-end structured editing system that significantly outperforms prior methods on our newly curated I2E-Bench and multiple public benchmarks, achieving notable advances in complex instruction understanding, physical plausibility, and multi-turn editing stability.

Technology Category

Application Category

πŸ“ Abstract
Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel"Decompose-then-Action"paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
Problem

Research questions and friction points this paper is trying to address.

text-guided image editing
compositional editing
object-level control
spatial reasoning
pixel-level inpainting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose-then-Action
Object-level Editing
Vision-Language-Action Agent
Chain-of-Thought Reasoning
Structured Interactive Environment
πŸ”Ž Similar Papers
No similar papers found.