🤖 AI Summary
This study identifies fundamental capability boundaries of long-context language models (LCLMs) on multi-match and logical retrieval tasks, demonstrating that merely expanding context window size is insufficient. Method: We systematically construct a diverse retrieval benchmark, design a controllable-step reasoning mechanism, and establish a standardized evaluation framework to isolate the effects of reasoning steps and chain-of-thought (CoT) prompting. Contribution/Results: We formally characterize the “reasoning-step criticality” phenomenon—the first systematic identification of LCLMs’ retrieval capability thresholds—challenging the prevailing assumption that longer contexts inherently yield better performance. Empirical results show failure rates exceeding 90% under standard settings; with step-adapted CoT prompts, accuracy improves to over 95%, albeit at significantly increased computational cost. This work underscores the necessity of sufficient, controllable reasoning steps and task-specific CoT design—not just extended context—for effective retrieval in LCLMs.
📝 Abstract
Long-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite they are nearly perfect at standard long-context retrieval tasks, we find they are not good at all types of retrieval tasks. Specifically, we identify 2 basic cases,"multi-matching retrieval,"and"logic-based retrieval", which are beyond LCLMs' ability boundary under normal settings. Later, we find these cases can be well addressed with a specific number of reasoning steps, guided by specific CoT prompts, but it may cost too much time. Thus we propose a critical viewpoint that there are currently no perfect solutions for current LCLMs to solve all types of retrieval tasks. Our work reveals some novel properties of retrieval tasks and LCLMs, proving that long-context handling still has a long way to go.