🤖 AI Summary
This work identifies and formally characterizes a previously underexplored phenomenon—termed the “curse of position”—where large language models exhibit significantly degraded performance when retrieving elements near the end of long input sequences. To systematically investigate this issue, the authors propose a novel bidirectional position retrieval evaluation paradigm based on anchor-offset prompting and introduce PosBench, the first benchmark dataset specifically designed to assess positional understanding. Through LoRA-based post-training on PosBench, both open- and closed-source models demonstrate substantial improvements in forward and backward position retrieval accuracy. Furthermore, enhanced performance on the PyIndex code comprehension task confirms the generalization benefits of improved positional awareness. Despite these gains, the findings underscore a fundamental deficiency in models’ intrinsic positional reasoning capabilities acquired during pretraining.
📝 Abstract
Modern large language models (LLMs) can find a needle in a haystack (locating a single relevant fact buried among hundreds of thousands of irrelevant tokens) with near-saturated accuracy, yet fail to retrieve the last few items in a short list. We call this failure the Position Curse. For instance, even in a two-line code snippet, Claude Opus 4.6 misidentifies the second-to-last line most of the time. To characterize this failure, we evaluated two complementary queries: given a position in a sequence (of letters or words), retrieve the corresponding item; and given an item, return its position. Each position is specified as a forward or backward offset from an anchor, either an endpoint of the list (its start or end) or another item in the list. Across both open-source and frontier closed-source models, backward retrieval substantially lags forward retrieval. To test whether this capability can be rescued by post-training, we constructed PosBench, a position-focused training dataset. LoRA fine-tuning improves both forward and backward retrieval and generalizes to a held-out code-understanding benchmark (PyIndex), yet absolute performance remains far from saturated. As LLM coding agents increasingly operate over large codebases where precise indexing becomes essential for code understanding and editing, position-based retrieval emerges as a key capability for future pretraining objectives and model design.