Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing text-guided diffusion models struggle to achieve fine-grained control over the spatial layout, tissue morphology, and semantic details of pathological images, largely due to the scarcity of large-scale paired layout-diagnostic description data. To address this, this work proposes the In-Context Diffusion Transformer (IC-DiT), which integrates spatial layout, textual descriptions, and visual embeddings through a hierarchical multimodal attention mechanism to ensure both global semantic consistency and local structural precision. The authors innovatively design a multi-agent large vision-language model (LVLM) framework to automatically generate clinically aligned, fine-grained annotations at scale. Furthermore, they introduce layout guidance into the diffusion Transformer for the first time, enabling controllable pathological image generation. Evaluated on five histopathology datasets, the method significantly outperforms existing approaches in image fidelity, spatial controllability, and diagnostic consistency, while also enhancing downstream tasks such as cancer classification and survival analysis.

Technology Category

Application Category

📝 Abstract

Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.

Problem

Research questions and friction points this paper is trying to address.

controllable pathology image generation

spatial layout

fine-grained structural constraints

diffusion models

histopathology datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Diffusion Transformer

layout-guided generation

multimodal attention