🤖 AI Summary
This work addresses the scalability bottleneck in robot policy learning caused by the high cost of demonstration data. It proposes DeMiAn, a framework that systematically leverages language density as a scalable signal for policy learning: dense, multi-dimensional language annotations—covering actions, scenes, poses, and reasoning—are generated from existing videos using vision-language models, and during deployment, an asynchronous instruction generator dynamically selects task- and scene-appropriate guidance. Without requiring additional demonstrations, DeMiAn improves policy performance, achieving a 5-percentage-point gain in success rate over the baseline on RoboCasa, approaching the performance of a task-specific oracle (within 3 points), while substantially enhancing compositional task execution and out-of-distribution generalization, alongside a favorable computation-performance trade-off.
📝 Abstract
Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.