🤖 AI Summary
This work addresses the lack of effective control over instruction granularity in existing language-guided embodied agents, which obscures its impact on agent behavior. The authors introduce the Mini-BEHAVIOR-Gran benchmark, providing multi-granularity instruction variants—from high-level goals to step-by-step guidance—for each task, and propose “planning width” as a quantitative metric to characterize granularity across tasks. Through multi-granularity instruction generation, vision-language policy training, and evaluation, they uncover a U-shaped, non-monotonic relationship between instruction granularity and agent performance: peak performance occurs at both extremely coarse and extremely fine granularities. Notably, success under coarse instructions stems from vision-dominated shallow semantic grounding. Planning width is validated as the strongest correlate of agent performance among all evaluated metrics.
📝 Abstract
Instruction granularity is an important yet poorly controlled variable in language-guided embodied AI. Existing benchmarks typically pair each task with a single static instruction, making it difficult to study how agent behavior changes when the same task is described at different levels of detail. We introduce Mini-BEHAVIOR-Gran, a new benchmark for controlled studies of instruction granularity that extends Mini-BEHAVIOR with multiple instruction variants per task, ranging from high-level goal descriptions to step-by-step guidance. Using this benchmark, we compare four candidate metrics for cross-task granularity quantification: token count, entity count, action-verb count, and planning-width, and find that width correlates most consistently with agent performance. Using width to organize training and evaluation further reveals a non-monotonic U-shaped relationship between instruction granularity and performance, with peaks at both fine and coarse extremes. Further analysis suggests that the coarse-granularity performance rebound is associated with shallow grounding, where agents learn vision-dominant policies.