Think, But Don't Overthink: Reproducing Recursive Language Models

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study investigates the impact of recursive language models (RLMs) at varying recursion depths on large language models’ ability to handle long-context and complex reasoning tasks, with a focus on performance degradation and surging computational costs. By offloading prompts to an external REPL environment, the RLM framework is extended to support near-infinite context lengths. The work presents the first empirical evidence of an “overthinking” phenomenon: while shallow recursion (depth 1) improves accuracy on complex tasks, deeper recursion (depth 2) degrades performance and drastically increases execution time—from 3.6 to 344.5 seconds—and induces exponential token consumption. Experiments conducted with DeepSeek v3.2 and Kimi K2 models systematically evaluate the trade-offs among standard LLMs, RLMs (depth 1), and RLMs (depth 2) on the S-NIAH and OOLONG benchmarks.

Technology Category

Application Category

📝 Abstract

This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction

Problem

Research questions and friction points this paper is trying to address.

Recursive Language Models

overthinking

recursion depth

reasoning performance

computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive Language Models

recursion depth

overthinking