On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

This work investigates whether next-token prediction in large language models (LLMs) induces a trade-off between computational efficiency and generalization on shortest-path reasoning tasks. Method: The authors construct a hierarchical graph dataset comprising question-trajectory-answer triples and train decoder-only Transformers with a custom tokenizer designed to encode structured reasoning steps. Crucially, they train models on semantically coherent yet suboptimal, backtracking-rich long trajectories—rather than dynamically programmed shortest paths. Contribution/Results: Contrary to intuition, models trained on such “inefficient” trajectories achieve significantly better zero-shot generalization to unseen graph topologies than those trained on optimal paths. The study reveals that next-token prediction inherently favors high-probability, high-consistency trajectories; these trajectories, though computationally redundant, provide stronger implicit learning signals that enhance generalization—not merely path compression. This challenges the assumption that optimal reasoning traces are universally superior for training LLMs on structured reasoning tasks.

Technology Category

Application Category

📝 Abstract

Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.

Problem

Research questions and friction points this paper is trying to address.

Studies bias in next-token predictors favoring inefficient reasoning

Compares optimal vs. longer reasoning traces in shortest-path tasks

Explores how trace coherence affects model generalization and confidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled shortest-path tasks in layered graphs

Custom tokenizer for question-trace-answer triples

Long coherent traces improve next-token confidence

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling