🤖 AI Summary
This work addresses the challenge of text-guided operable region localization in RGB-D images, where missing observations hinder accurate prediction. The authors propose a generative framework that first fully reconstructs object geometry and then predicts operability over the complete shape. Key innovations include multi-view generative reconstruction via sparse voxel fusion, manifold-based modeling of operability distributions to capture semantic ambiguity, and an operability-driven active viewpoint selection strategy. Experimental results demonstrate significant improvements over existing methods, achieving an aIoU of 19.1 for operable region localization (a 40.4% relative gain) and a 3D reconstruction IoU of 32.67 (a 67.7% relative improvement).
📝 Abstract
This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete geometry from partial observations and grounds affordances on the full shape including unobserved regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling. Affostruction achieves 19.1 aIoU on affordance grounding (40.4\% improvement) and 32.67 IoU for 3D reconstruction (67.7\% improvement), enabling accurate affordance prediction on complete shapes.