Long-Horizon Visual Imitation Learning via Plan and Code Reflection

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the dual challenges of temporal action modeling and spatial object relationship understanding in long-horizon visual imitation, this paper proposes a dual-reflection agent framework integrating high-level plan generation with low-level executable code generation. The framework introduces the first “plan–code” co-verification mechanism, jointly enforcing semantic and structural consistency to iteratively optimize temporal coherence and spatial alignment, while enabling error detection and self-correction. It generates both abstract plans and concrete, executable code end-to-end from video demonstrations, leveraging code as a precise, deterministic policy representation to enhance behavioral reliability. To advance the field, we introduce LongVILBench—the first long-sequence visual imitation benchmark—comprising 300 complex human demonstrations. Experiments reveal severe performance degradation of existing methods on this benchmark, whereas our framework establishes new strong baselines across diverse, challenging tasks.

Technology Category

Application Category

📝 Abstract
Learning from long-horizon demonstrations with complex action sequences presents significant challenges for visual imitation learning, particularly in understanding temporal relationships of actions and spatial relationships between objects. In this paper, we propose a new agent framework that incorporates two dedicated reflection modules to enhance both plan and code generation. The plan generation module produces an initial action sequence, which is then verified by the plan reflection module to ensure temporal coherence and spatial alignment with the demonstration video. The code generation module translates the plan into executable code, while the code reflection module verifies and refines the generated code to ensure correctness and consistency with the generated plan. These two reflection modules jointly enable the agent to detect and correct errors in both the plan generation and code generation, improving performance in tasks with intricate temporal and spatial dependencies. To support systematic evaluation, we introduce LongVILBench, a benchmark comprising 300 human demonstrations with action sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial complexity across multiple task types. Experimental results demonstrate that existing methods perform poorly on this benchmark, whereas our new framework establishes a strong baseline for long-horizon visual imitation learning.
Problem

Research questions and friction points this paper is trying to address.

Addresses long-horizon visual imitation learning challenges
Handles complex temporal and spatial action relationships
Improves plan and code generation via reflection modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plan reflection module ensures temporal and spatial alignment
Code reflection module verifies and refines executable code
Dual reflection modules jointly detect and correct errors
🔎 Similar Papers
Q
Quan Chen
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology
Chenrui Shi
Chenrui Shi
Beijing Institute of Technology
anomaly detection
Q
Qi Chen
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Z
Zhi Gao
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology
X
Xintong Zhang
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology
R
Rui Gao
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
K
Kun Wu
Beijing Innovation Center of Humanoid Robotics
Y
Yunde Jia
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University