How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the scalability bottleneck in robot policy learning caused by the high cost of demonstration data. It proposes DeMiAn, a framework that systematically leverages language density as a scalable signal for policy learning: dense, multi-dimensional language annotations—covering actions, scenes, poses, and reasoning—are generated from existing videos using vision-language models, and during deployment, an asynchronous instruction generator dynamically selects task- and scene-appropriate guidance. Without requiring additional demonstrations, DeMiAn improves policy performance, achieving a 5-percentage-point gain in success rate over the baseline on RoboCasa, approaching the performance of a task-specific oracle (within 3 points), while substantially enhancing compositional task execution and out-of-distribution generalization, alongside a favorable computation-performance trade-off.
📝 Abstract
Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.
Problem

Research questions and friction points this paper is trying to address.

robot policy learning
language annotations
demonstration reuse
dense annotation
scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

dense language annotation
robot policy learning
vision-language-action policy
multi-aspect instruction
demonstration re-labeling