LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the task of controllable binary image segmentation (DIS). We propose a language-window co-driven latent diffusion framework featuring a macro-micro dual-mode control mechanism: at the macro level, natural language prompts guide coarse segmentation initialization; at the micro level, an adjustable spatial window refines the mask via localized optimization—both modes can be deployed independently or jointly. Crucially, we unify linguistic semantics and geometric window constraints within the latent diffusion process, enabling synergistic enhancement of semantic controllability and spatial precision. Evaluated on the DIS5K benchmark, our method outperforms 11 state-of-the-art approaches across all subsets. Notably, on the DIS-TE test set, it achieves a 4.6% improvement in the Fₐᵦ^ω metric over the second-best method, MVANet. The framework significantly advances personalized segmentation accuracy and interactive flexibility.

Technology Category

Application Category

📝 Abstract
We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_β^ω$ gains of 4.6% with both the LS and WR strategies and 3.6% gains with only the LS strategy on DIS-TE. Codes will be made available at https://github.com/XinyuYanTJU/LawDIS.
Problem

Research questions and friction points this paper is trying to address.

Enables precise object segmentation using language prompts
Integrates macro and micro control modes for refinement
Outperforms existing methods in image segmentation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-controlled segmentation for initial masks
Window-controlled refinement for user-defined regions
Macro-to-micro control modes for flexible operation
🔎 Similar Papers
No similar papers found.
X
Xinyu Yan
Tianjin University
M
Meijun Sun
Tianjin University
Ge-Peng Ji
Ge-Peng Ji
Australian National University
Multimodal AIMedical AIComputer Vision
Fahad Shahbaz Khan
Fahad Shahbaz Khan
MBZUAI, Linköping University Sweden
Computer VisionObject RecognitionGenerative AIAI for Science
S
Salman Khan
MBZUAI
D
Deng-Ping Fan
Nankai Institute of Advanced Research (SHENZHEN FUTIAN)