Integrating Multimodal Large Language Model Knowledge into Amodal Completion

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing approaches to image occlusion completion struggle to effectively incorporate real-world physical knowledge, often resulting in inaccurate reconstructions of occluded objects or people. This work proposes AmodalCG, a novel framework that explicitly integrates the commonsense reasoning capabilities of multimodal large language models (MLLMs) into the amodal completion process for the first time. AmodalCG employs an occlusion-aware mechanism to assess missing regions and selectively invokes the MLLM for semantic inference when needed, followed by iterative refinement using a visual generative model. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art techniques across diverse real-world scenarios, highlighting the effectiveness and potential of leveraging MLLMs for complex amodal completion tasks.

Technology Category

Application Category

📝 Abstract

With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

Problem

Research questions and friction points this paper is trying to address.

amodal completion

Multimodal Large Language Models

occlusion

real-world knowledge

image reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

amodal completion

Multimodal Large Language Models

occlusion reasoning