🤖 AI Summary
Existing embodied agents often neglect the manipulability of target objects during instruction execution, lacking the ability to perceive and reason about implicit affordance constraints. To address this limitation, this work introduces DynAfford, a dynamic affordance benchmark that, for the first time, evaluates agents’ commonsense planning capabilities in scenarios where affordance constraints are not explicitly specified. Furthermore, the authors propose ADAPT, a plug-and-play module that integrates a vision-language model—fine-tuned with LoRA and adapted to the domain—as an affordance reasoning backend, thereby endowing existing planners with explicit, task-adaptive affordance judgment. Experimental results demonstrate that ADAPT significantly improves task success rates and robustness in both seen and unseen environments, outperforming GPT-4o in affordance-aware planning.
📝 Abstract
Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.