CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses embodied reference understanding—localizing referred objects in a scene via joint language instructions and pointing gestures. Existing methods suffer from coarse visual pointing modeling and over-reliance on a single geometric cue (e.g., head-to-fingertip line), leading to ambiguity in referent disambiguation. To overcome this, we propose a dual-pointing-cue modeling framework that jointly generates Gaussian ray heatmaps from both head-to-fingertip and wrist-to-fingertip vectors. A CLIP-aware fusion module dynamically weights and integrates multimodal features. Additionally, we introduce an object-center prediction auxiliary task and a CLIP-guided hybrid ensemble mechanism to strengthen cross-modal alignment. Evaluated on the YouRefIt dataset, our method achieves a ~4 percentage point improvement in mAP at 0.25 IoU threshold, demonstrating significant gains in pointing perception accuracy and robustness.

Technology Category

Application Category

📝 Abstract

We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.

Problem

Research questions and friction points this paper is trying to address.

Predicting referred objects using pointing and language

Integrating textual, visual, and scene context cues

Overcoming limitations of single-line pointing assumptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-model framework for head and wrist pointing

Gaussian ray heatmap for pointing cues

CLIP-Aware hybrid ensemble for multimodal fusion

🔎 Similar Papers

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension