VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This work addresses the limited precision in localizing interactive regions within open-vocabulary 3D point clouds, a challenge arising from the lack of spatial structure in semantic labels. To overcome this, the authors propose a Voxel-enhanced Affordance detection framework that, for the first time, integrates multi-scale voxelized geometric features extracted by a pretrained 3D VQVAE with language-guided autoregressive output tokens. Cross-attention mechanisms align semantic queries with geometric patterns, while a learnable gating mechanism dynamically modulates the fusion strength between modalities, yielding spatially aware and highly generalizable segmentation masks. With the VQVAE encoder frozen, the method achieves state-of-the-art performance on open-vocabulary 3D affordance detection, improving mIoU by approximately 8%, and demonstrates successful zero-shot transfer to real-world robotic manipulation of novel objects.
📝 Abstract
Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary
3D affordance detection
point clouds
spatial localization
multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-scale voxel-token fusion
open-vocabulary 3D affordance detection
cross-attention retrieval
semantic-conditioned attention
zero-shot transfer
Haowen Sun
Haowen Sun
Department of Automation, Tsinghua University
Computer Vision
S
Shaolong Zhang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049
M
Mingyang Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049
Chengzhong Ma
Chengzhong Ma
Unknown affiliation
X
Xinzhe Chen
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049
Q
Qiongjie Cui
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049
Xingyu Chen
Xingyu Chen
PhD Candidate, University of Technology Sydney, Australian National University
Spatial AudioHRTF
Z
Zeyang Liu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049
X
Xuguang Lan
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049