Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses a key limitation in post-training large language models (LLMs) with reinforcement learning: the prevailing “history-as-state” paradigm restricts the discovery of capabilities not covered during pretraining. To overcome this, the study systematically introduces the explicit Markovian state mechanism from classical reinforcement learning into LLM post-training, replacing the full historical sequence with a compact, estimated state representation. This approach alleviates existing bottlenecks in exploration and reasoning. Theoretical analysis demonstrates that the proposed framework achieves lower sample complexity, while empirical results show significant improvements over standard post-training methods across multiple complex logical reasoning tasks, exhibiting enhanced generalization and a greater capacity for discovering novel strategies.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.

Problem

Research questions and friction points this paper is trying to address.

capability ceiling

Markov states

LLM post-training

reinforcement learning

history-as-state

Innovation

Methods, ideas, or system contributions that make the work stand out.

Markov states

reinforcement learning

LLM post-training