VENUS: Visual Editing with Noise Inversion Using Scene Graphs

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenge in text-driven image editing of simultaneously preserving background fidelity and ensuring semantic consistency. Existing scene graph–based approaches typically require model fine-tuning, resulting in high computational costs and limited scalability. To overcome these limitations, we propose the first training-free, scene graph–guided image editing framework. Our method leverages a multimodal large language model to parse scene graphs and integrates diffusion models with noise inversion and a disentangled prompting mechanism to enable precise semantic modifications while maintaining the integrity of unedited regions. The approach significantly improves both editing quality and efficiency: on PIE-Bench, it boosts PSNR (22.45→24.80), SSIM (0.79→0.84), and LPIPS (0.100→0.070), achieving a CLIP similarity score of 24.97; on EditVal, it attains a DINO score of 0.87, reducing per-image editing time from 6–10 minutes to just 20–30 seconds.

Technology Category

Application Category

📝 Abstract

State-of-the-art text-based image editing models often struggle to balance background preservation with semantic consistency, frequently resulting either in the synthesis of entirely new images or in outputs that fail to realize the intended edits. In contrast, scene graph-based image editing addresses this limitation by providing a structured representation of semantic entities and their relations, thereby offering improved controllability. However, existing scene graph editing methods typically depend on model fine-tuning, which incurs high computational cost and limits scalability. To this end, we introduce VENUS (Visual Editing with Noise inversion Using Scene graphs), a training-free framework for scene graph-guided image editing. Specifically, VENUS employs a split prompt conditioning strategy that disentangles the target object of the edit from its background context, while simultaneously leveraging noise inversion to preserve fidelity in unedited regions. Moreover, our proposed approach integrates scene graphs extracted from multimodal large language models with diffusion backbones, without requiring any additional training. Empirically, VENUS substantially improves both background preservation and semantic alignment on PIE-Bench, increasing PSNR from 22.45 to 24.80, SSIM from 0.79 to 0.84, and reducing LPIPS from 0.100 to 0.070 relative to the state-of-the-art scene graph editing model (SGEdit). In addition, VENUS enhances semantic consistency as measured by CLIP similarity (24.97 vs. 24.19). On EditVal, VENUS achieves the highest fidelity with a 0.87 DINO score and, crucially, reduces per-image runtime from 6-10 minutes to only 20-30 seconds. Beyond scene graph-based editing, VENUS also surpasses strong text-based editing baselines such as LEDIT++ and P2P+DirInv, thereby demonstrating consistent improvements across both paradigms.

Problem

Research questions and friction points this paper is trying to address.

image editing

scene graph

background preservation

semantic consistency

computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

noise inversion

scene graph

training-free editing