ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling

📅 2025-01-05

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

There is a growing need for middle-school-friendly image editing tools that support intuitive, instruction-driven manipulation. Method: We propose a general instruction-driven image generation and editing framework built upon FLUX.1-dev, featuring a novel two-stage training paradigm (pretraining followed by instruction fine-tuning) and the first context-aware content filling mechanism. Our approach integrates a Long Context Unit (LCU), mask guidance, cross-task alignment, and multi-granularity conditional encoding to enable arbitrary-region inpainting, outpainting, and compositing under multimodal inputs—including text, masks, and reference images. We further design a lightweight, fully participatory dual-path fine-tuning architecture that reuses text-to-image priors and supports zero-reference initialization. Results: Extensive experiments demonstrate significant improvements over state-of-the-art methods in image fidelity, instruction adherence, and cross-task generalization, delivering a production-ready model family applicable to both general-purpose and education-specific scenarios.

Technology Category

Application Category

📝 Abstract

We report ACE++, an instruction-based diffusion framework that tackles various image generation and editing tasks. Inspired by the input format for the inpainting task proposed by FLUX.1-Fill-dev, we improve the Long-context Condition Unit (LCU) introduced in ACE and extend this input paradigm to any editing and generation tasks. To take full advantage of image generative priors, we develop a two-stage training scheme to minimize the efforts of finetuning powerful text-to-image diffusion models like FLUX.1-dev. In the first stage, we pre-train the model using task data with the 0-ref tasks from the text-to-image model. There are many models in the community based on the post-training of text-to-image foundational models that meet this training paradigm of the first stage. For example, FLUX.1-Fill-dev deals primarily with painting tasks and can be used as an initialization to accelerate the training process. In the second stage, we finetune the above model to support the general instructions using all tasks defined in ACE. To promote the widespread application of ACE++ in different scenarios, we provide a comprehensive set of models that cover both full finetuning and lightweight finetuning, while considering general applicability and applicability in vertical scenarios. The qualitative analysis showcases the superiority of ACE++ in terms of generating image quality and prompt following ability.

Problem

Research questions and friction points this paper is trying to address.

Image Creation

Automated Content Filling

User Interface Design for Middle School Students

Innovation

Methods, ideas, or system contributions that make the work stand out.

ACE++

Two-step Learning Approach

Adaptive Content Filling

🔎 Similar Papers

Streamlining Image Editing with Layered Diffusion Brushes