Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language-action (VLA) models often fall into a “memory trap” under out-of-distribution scenarios—reproducing previously observed trajectories instead of adapting to novel environments—due to the absence of explicit 3D spatial reasoning in their end-to-end architectures. Method: We propose SAF-VLA, a lightweight hybrid framework that introduces Spatial Affordance Fields (SAFs)—geometrically grounded, 3D operability representations—into VLA decision-making. SAF-VLA employs embodiment-aware real-time detection of memory traps, dynamically generates guidance paths toward high-affordance regions, and integrates a scoring mechanism for action selection. Contribution/Results: SAF-VLA synergistically combines end-to-end action generation with geometric awareness, requiring no additional annotations or model retraining. Evaluated on a real robotic platform and the LIBERO-Pro benchmark, it improves task success rates by 23.5% and 20.2%, respectively, significantly enhancing VLA robustness and generalization to environmental changes.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have shown great performance in robotic manipulation by mapping visual observations and language instructions directly to actions. However, they remain brittle under distribution shifts: when test scenarios change, VLAs often reproduce memorized trajectories instead of adapting to the updated scene, which is a failure mode we refer to as the "Memory Trap". This limitation stems from the end-to-end design, which lacks explicit 3D spatial reasoning and prevents reliable identification of actionable regions in unfamiliar environments. To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. Our system detects memory traps through proprioception, repositions the robot to recent high-affordance regions, and proposes affordance-driven waypoints that anchor VLA-generated actions. A SAF-based scorer then selects trajectories with the highest cumulative affordance. Extensive experiments demonstrate that our method achieves an average improvement of 23.5% across different VLA backbones ($π_{0}$ and $π_{0.5}$) under out-of-distribution scenarios on real-world robotic platforms, and 20.2% on the LIBERO-Pro benchmark, validating its effectiveness in enhancing VLA robustness to distribution shifts.
Problem

Research questions and friction points this paper is trying to address.

Enables VLAs to escape memory traps in robotic manipulation
Compensates for lacking explicit 3D spatial reasoning in VLAs
Enhances VLA robustness to distribution shifts in real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Spatial Affordance Fields as plug-in guidance
Detects memory traps via proprioception and repositions robot
Selects trajectories with highest cumulative affordance score
Siyu Xu
Siyu Xu
University of Sydney
RoboticsComputer VisionMachine Learning
Z
Zijian Wang
School of Computer Science, The University of Sydney
Yunke Wang
Yunke Wang
University of Sydney
generative modelroboticsimitation learningreinforcement learning
C
Chenghao Xia
School of Computer Science, The University of Sydney
T
Tao Huang
John Hopcropt Center for Computer Science, Shanghai Jiao Tong University
C
Chang Xu
School of Computer Science, The University of Sydney