🤖 AI Summary
This work addresses the limited precision in localizing interactive regions within open-vocabulary 3D point clouds, a challenge arising from the lack of spatial structure in semantic labels. To overcome this, the authors propose a Voxel-enhanced Affordance detection framework that, for the first time, integrates multi-scale voxelized geometric features extracted by a pretrained 3D VQVAE with language-guided autoregressive output tokens. Cross-attention mechanisms align semantic queries with geometric patterns, while a learnable gating mechanism dynamically modulates the fusion strength between modalities, yielding spatially aware and highly generalizable segmentation masks. With the VQVAE encoder frozen, the method achieves state-of-the-art performance on open-vocabulary 3D affordance detection, improving mIoU by approximately 8%, and demonstrates successful zero-shot transfer to real-world robotic manipulation of novel objects.
📝 Abstract
Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.