Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the zero-shot, cross-scene problem of natural language–driven 3D robotic grasping localization. To mitigate 3D localization errors arising from 2D image ambiguity and linguistic semantic uncertainty, we propose a context-aware 3D action point localization method. Our approach introduces a lightweight 2D point-guided mechanism that jointly leverages multi-view feature fusion and point cloud reconstruction to directly generate spatially aligned manipulation responses. Crucially, we design a task-specific 3D correlation field that bypasses high-dimensional feature mapping, enabling joint semantic-geometric modeling directly in 3D space—thereby significantly improving robustness under occlusion and ambiguity. The end-to-end pipeline (reconstruction → localization → pose estimation) completes in ≤20 seconds, producing highly localized 3D action points compatible with real-time robotic execution.

Technology Category

Application Category

📝 Abstract

We propose Point2Act, which directly retrieves the 3D action point relevant for a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, reasoning fine-grained 3D spatial context that can directly transfer to an explicit position for physical action at the on-the-fly reconstruction of the scene. Our full-stack pipeline, which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction, generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks. Project page: https://sangminkim-99.github.io/point2act/

Problem

Research questions and friction points this paper is trying to address.

Retrieve precise 3D action points from contextual tasks

Overcome blurry 2D regions for accurate 3D localization

Enable zero-shot task execution in unseen environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Multimodal LLMs for 3D action retrieval

Uses 3D relevancy fields for precise localization

Multi-view aggregation handles occlusion ambiguities

🔎 Similar Papers

No similar papers found.