CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the failure of vision-language-action (VLA) models in non-Markovian long-horizon tasks, where reliance solely on the most recent observation leads to performance degradation due to issues such as object occlusion or loss of early critical evidence. To overcome this limitation, the authors propose a hierarchical framework that maintains task-relevant entity relationships through a persistent semantic graph and integrates an executable code-based planner to enable efficient progress tracking and robust state construction against disturbances. Furthermore, they introduce a progress-guided visual-language prompting mechanism that generates subtask instructions focused on key objects to effectively guide VLA execution. This approach is the first to fuse executable code with semantic graph states, significantly improving task completion rates on real-world non-Markovian tasks over strong baselines and history-augmented VLA models, while substantially reducing planning latency.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

non-Markovian tasks

long-horizon manipulation

partial observability

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-graph state

code-based planner

non-Markovian VLA