Debugging code world models

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study addresses key challenges in code world models (CWMs) when simulating program execution, including ambiguous error sources, weak long-range state tracking, and inadequate modeling of string states. Through systematic analysis of failure mechanisms from both local semantic execution and long-range state propagation perspectives, the work identifies two core issues: token exhaustion caused by dense state outputs and limitations in string state representation due to subword tokenization. Using real-code execution benchmarks and controlled permutation-tracking tasks with explicit state prediction, the research demonstrates that CWMs can effectively propagate states over long horizons when provided with ground-truth action sequences, indicating that errors primarily stem from action generation rather than state propagation itself. These findings offer new directions for designing more efficient supervision signals and state representations aligned with data types.

Technology Category

Application Category

📝 Abstract

Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs'limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long-horizon behavior, we use a controlled permutation-tracking benchmark that isolates state propagation under action execution. We show that long-horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground-truth commands, a Transformer-based CWM propagates state accurately over long horizons, despite known limitations of Transformers in long-horizon state tracking. These findings suggest directions for more efficient supervision and state representations in CWMs that are better aligned with program execution and data types.

Problem

Research questions and friction points this paper is trying to address.

Code World Models

execution simulation

state tracking

tokenization limitations

long-horizon degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Code World Models

runtime state tracking

subword tokenization