🤖 AI Summary
Long-horizon robotic manipulation often lacks dense reward signals aligned with task procedures, causing existing methods to misinterpret mere time progression as task advancement and struggle to detect stagnation or failure. To address this, this work proposes ProcVLM, a program-structure-guided vision-language model that introduces a novel paradigm: first reasoning about the remaining atomic actions and then estimating task progress. ProcVLM integrates subtask semantic annotations, a visual-change-driven progress allocation mechanism, and joint pretraining for action segmentation and future planning. Leveraging a newly curated large-scale program-aware dataset, ProcCorpus-60M, along with the ProcVQA benchmark, ProcVLM significantly outperforms baseline methods in progress estimation and reward modeling, yielding more discriminative dense reward signals that effectively enhance downstream policy optimization.
📝 Abstract
Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change. To train ProcVLM at scale, we build a standardized procedural supervision synthesis pipeline and construct ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining, with progress estimation as the central task alongside action segmentation and future planning. Experiments on ProcVQA and reward-model benchmarks show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines, supporting its use as a dense reward model for downstream reward-guided policy optimization. Project page: https://procvlm.github.io/