🤖 AI Summary
This work addresses the challenge of achieving efficient panel-aware contextual image generation while preserving the intrinsic geometric structure and intra-panel generative behavior of pretrained diffusion models. The authors propose a parameter-efficient adaptation method that introduces learnable, panel-specific orthogonal operators on top of frozen positional encodings to enable relative panel conditioning in diffusion Transformers. These operators are designed to be isometric and invariant within the same panel, allowing seamless compatibility with diverse positional encoding schemes without modifying the backbone architecture. Experimental results demonstrate that the proposed approach substantially improves image-instructed contextual generation and effectively enhances the performance of existing state-of-the-art methods.
📝 Abstract
We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model's pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.