BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work proposes a method for end-to-end learning of real-world robotic manipulation directly from human demonstration videos, without requiring any robot demonstration data. The core idea is to use affordance as an intermediate representation to decouple the task into two subproblems: “where to grasp” and “how to move.” Specifically, the approach identifies task-relevant affordance regions through visual perception and predicts task-conditioned 3D motion affordances to generate executable robot actions. This framework achieves, for the first time, direct transfer from raw human videos to real robotic manipulation, unifying complex tasks as compositions of affordance-based actions. Evaluated in the real world, the method outperforms existing approaches and demonstrates strong generalization to unseen objects, scenes, and camera viewpoints.

Technology Category

Application Category

📝 Abstract

Learning robot manipulation from human videos is appealing due to the scale and diversity of human demonstrations, but transferring such demonstrations to executable robot behavior remains challenging. Prior work either relies on robot data for downstream adaptation or learns affordance representations that remain at the perception level and do not directly support real-world execution. We present BridgeACT, an affordance-driven framework that learns robotic manipulation directly from human videos without requiring any robot demonstration data. Our key idea is to model affordance as an embodiment-agnostic intermediate representation that bridges human demonstrations and robot actions. BridgeACT decomposes manipulation into two complementary problems: where to grasp and how to move. To this end, BridgeACT first grounds task-relevant affordance regions in the current scene, and then predicts task-conditioned 3D motion affordances from human demonstrations. The resulting affordances are mapped to robot actions through a grasping module and a lightweight closed-loop motion controller, enabling direct deployment on real robots. In addition, we represent complex manipulation tasks as compositions of affordance operations, which allows a unified treatment of diverse tasks and object-to-object interactions. Experiments on real-world manipulation tasks show that BridgeACT outperforms prior baselines and generalizes to unseen objects, scenes, and viewpoints.

Problem

Research questions and friction points this paper is trying to address.

robot manipulation

human demonstrations

affordance learning

vision-to-action transfer

zero-shot robot execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

affordance

human-to-robot transfer

embodiment-agnostic representation