RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

Existing single-reference image editing methods struggle to model and transfer non-rigid, content-aware visual relationships. To address this, we propose the first few-shot visual relation editing framework designed for editing intent generalization. First, we introduce RelationAdapter—a lightweight module that, for the first time, integrates explicit relational modeling into the Diffusion Transformer (DiT) architecture, enabling context-aware extraction and transfer of editing intents. Second, we construct Relation252K, a large-scale benchmark encompassing 218 distinct relational editing tasks, thereby filling a critical gap in evaluation resources for this domain. Extensive experiments demonstrate that our method significantly outperforms single-reference baselines in editing accuracy, generation quality, and cross-image intent transfer capability. This work establishes a new paradigm for visual editing—shifting the focus from superficial appearance adjustment toward deep semantic relationship understanding.

Technology Category

Application Category

📝 Abstract

Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

Problem

Research questions and friction points this paper is trying to address.

Enabling non-rigid image transformations via visual prompts

Transferring editing intent from source-target pairs to new images

Improving generalization in visual prompt-driven editing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages source-target pairs for content-aware editing

Introduces RelationAdapter for DiT-based transformation learning

Uses Relation252K dataset for evaluating model generalization

🔎 Similar Papers

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures