🤖 AI Summary
Existing controllable image semantic understanding tasks rely on strong prompts (e.g., fine-grained text or precise masks), resulting in high interaction overhead and limited output diversity. This paper introduces a novel task, “Collaborative Image Segmentation and Captioning” (SegCaptioning), which generates diverse (caption, mask) semantic pairs from coarse prompts (e.g., object bounding boxes), enabling flexible user selection. To address the challenges of user intent modeling and multimodal alignment under weak supervision, we propose the first scene-graph-guided diffusion framework: a Prompt-Centric Scene Graph Adaptor explicitly models user intent, while a dual-modal Transformer jointly generates captions and masks. Additionally, we introduce a multi-entity contrastive learning loss to enforce cross-modal semantic consistency. Evaluated on two benchmark datasets, our method achieves significant improvements in both segmentation and captioning under low-prompt conditions, establishing new state-of-the-art performance and demonstrating strong generalization capability.
📝 Abstract
Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as high-cost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning'' (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for correlated mask-caption prediction. Initially, we introduce a Prompt-Centric Scene Graph Adaptor to map a user's prompt to a scene graph, effectively capturing his intention. Subsequently, we employ a diffusion process incorporating a Scene Graph Guided Bimodal Transformer to predict correlated caption-mask pairs by uncovering intricate correlations between them. To ensure accurate alignment, we design a Multi-Entities Contrastive Learning loss to explicitly align visual and textual entities by considering inter-modal similarity, resulting in well-aligned caption-mask pairs. Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in SegCaptioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input.