Retrieval-Augmented Robots via Retrieve-Reason-Act

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing robotic systems struggle to acquire and execute previously unseen complex assembly tasks from unstructured visual documentation in zero-shot settings. This work proposes the Retrieval-Augmented Robotics (RAR) paradigm, which enables embodied agents to actively retrieve visual assembly manuals through a Retrieve–Reason–Act loop, align 2D illustrations with 3D physical objects, and generate executable action plans. RAR represents the first approach to leverage information retrieval as a driver for physical actions by embodied agents, marking a paradigm shift from passive execution to active acquisition of external procedural knowledge. Experimental results demonstrate that RAR significantly outperforms baseline methods relying on zero-shot reasoning or few-shot examples in long-horizon assembly tasks, thereby validating the efficacy of visual document retrieval for zero-shot task planning.

Technology Category

Application Category

📝 Abstract

To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text-based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval-Augmented Robotics (RAR), empowering the robot with the information-seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve-Reason-Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross-modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long-horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Robotics

zero-shot task execution

procedural knowledge retrieval

visual documentation grounding

embodied information seeking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Robotics

zero-shot task execution

visual procedural retrieval