🤖 AI Summary
This work addresses the shape bias introduced by coarse masks in local image editing, which often constrains generated objects with unintended boundaries and compromises background consistency. To mitigate this, the authors propose BridgePath, a DiT-based dual-path generation framework that decouples the generation of background and editable regions, thereby avoiding direct mask injection into the backbone. A learnable discrete geometric gating mechanism is introduced to enable token-level routing of positional embeddings, allowing foreground tokens in the fusion zone to flexibly adopt background coordinates or retain their own geometric freedom. With only 13.31M additional parameters, BridgePath achieves a Local SigLIP2-T score of 0.503 on BRIDGE-Bench, significantly outperforming baseline methods, and demonstrates superior zero-shot alignment and source-image preservation on MagicBrush and ICE-Bench.
📝 Abstract
Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.