Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing models lack chain-of-thought (CoT) reasoning capabilities, hindering their ability to model shared affordances across objects and limiting cross-domain generalization and explicit reasoning. To address this, we propose the first unified affordance localization framework that integrates cognitive-chain reasoning with Group Relative Policy Optimization (GRPO)—a novel reinforcement learning algorithm—enabling zero-shot generalization and emergent test-time reasoning without explicit reasoning annotations. Our approach leverages multimodal large language models and introduces a multi-dimensional reward function jointly optimizing format compliance, perceptual grounding, and cognitive reasoning. We further construct ReasonAff, a new benchmark dataset for affordance reasoning. Experiments demonstrate substantial improvements over state-of-the-art methods across multiple benchmarks, with robust zero-shot transfer and open-world generalization. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.
Problem

Research questions and friction points this paper is trying to address.

Enhancing affordance grounding for robot action regions
Improving generalization via Chain-of-Thought reasoning
Integrating reinforcement learning with cognitive rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates GRPO-based RL for affordance reasoning
Uses format, perception, cognition rewards
Constructs ReasonAff dataset for training
🔎 Similar Papers
2024-02-20Conference on Computational Natural Language LearningCitations: 5
H
Hanqing Wang
The Hong Kong University of Science and Technology (GZ)
S
Shaoyang Wang
National University of Singapore
Yiming Zhong
Yiming Zhong
Shanghaitech
Embodied AIMachine Learning
Zemin Yang
Zemin Yang
Master Student in ShanghaiTech University
Computer VisionEmbodied AI
Jiamin Wang
Jiamin Wang
Virginia Tech
ControlDynamics
Z
Zhiqing Cui
Nanjing University of Information Science & Technology
J
Jiahao Yuan
East China Normal University
Y
Yifan Han
Institute of Automation, Chinese Academy of Science
Mingyu Liu
Mingyu Liu
Technical University of Munich
Computer VisionDeep Learning
Yuexin Ma
Yuexin Ma
Assistant Professor, School of Information Science and Technology, ShanghaiTech University
computer visionembodied AIautonomous driving