ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an end-to-end trainable, structure-aware planning method to address the low sample efficiency and high computational cost commonly encountered by agents in complex environments. The key innovation lies in the first-time integration of a differentiable Viterbi algorithm with procedural knowledge graphs, wherein a differentiable Viterbi layer explicitly embeds graph-structured priors to enable efficient and structured decoding of action sequences. Evaluated on the CrossTask, COIN, and NIV datasets, the proposed approach achieves state-of-the-art performance with an order of magnitude fewer parameters, significantly improving sample efficiency and robustness to unseen short-horizon tasks.

Technology Category

Application Category

📝 Abstract
Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.
Problem

Research questions and friction points this paper is trying to address.

procedural planning
instructional videos
sample efficiency
computational cost
action sequence prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable Viterbi
Procedural Knowledge Graph
End-to-end Planning
Sample Efficiency
Instructional Video Understanding
🔎 Similar Papers
No similar papers found.