🤖 AI Summary
Existing text-guided segmentation methods struggle to comprehend user instructions, infer implicit intentions, and handle complex occlusion scenarios. To address these limitations, we introduce the novel task of *occlusion-aware reasoning segmentation*, requiring joint prediction of complete 3D geometry for occluded objects and generation of interpretable textual explanations. We construct the first generalizable, photorealistic synthetic dataset specifically designed for everyday occlusion scenarios. We propose AURA, a model integrating occlusion-aware feature disentanglement, cross-modal attention guidance, and physics- and commonsense-driven data synthesis. Notably, AURA is the first to incorporate multimodal large language model alignment into the text-guided segmentation paradigm. Extensive experiments on our benchmark demonstrate that AURA significantly outperforms baselines—including LISA—achieving high-fidelity full-shape completion and accurate, interpretable natural-language responses. All code, models, and datasets are publicly released.
📝 Abstract
Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset will be publicly released.