Expanding Computation Spaces of LLMs at Inference Time

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) can expand their implicit computational capacity during inference via人工 insertion of filler token sequences. Method: We propose a fine-tuning-free, inference-time technique that strategically inserts structured filler tokens—of specific type and quantity—at critical positions (e.g., preceding “Answer:”) to augment reasoning space. Contribution/Results: We empirically validate substantial performance gains on open-domain question answering and mathematical reasoning tasks, demonstrating for the first time that *prefix* filler injection can significantly unlock latent reasoning capabilities—especially in smaller models, with diminishing returns as model scale increases. Ablation studies and attention visualization across a 1.7B–32B model spectrum reveal structured computational dynamics within the expanded token space. Notably, SmolLM2-1.7B-Instruct achieves a +12.37 percentage point improvement, establishing filler-based prompting as a lightweight, parameter-free paradigm for inference-time reasoning enhancement.

Technology Category

Application Category

📝 Abstract

Chain-of-thought (CoT) rationale enables language models to use additional task-related text for problem-solving, benefiting not only from detailed reasoning steps but also from the expanded computational space of longer inputs. Prior work has trained filler or special tokens to serve as additional computation spaces. In this study, we investigate whether language models can leverage artificially inserted sequences of filler tokens solely at inference. We first identify effective token types, numbers, and insertion locations, then examine at what stage of training models begin to exploit the expanded computation space, and finally analyze dynamics within these spaces via attention maps. Experiments on models ranging from 1.7B to 32B across open-domain QA and math tasks show that appropriate token types and counts vary, but placing filler tokens directly before the final 'Answer:' token is most effective. Smaller models benefit most, up to 12.372 percentage points in SmolLM2-1.7B-Instruct, indicating that these spaces act as additional computational capacity rather than redundant input. Attention maps reveal that expanded spaces often continue the original attention mechanism and sometimes focus on questions or answer options, suggesting meaningful computation for problem-solving.

Problem

Research questions and friction points this paper is trying to address.

Investigating artificial filler tokens as inference-time computational expansion

Determining optimal token types and placement for enhanced reasoning

Analyzing attention dynamics in expanded computational spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Insert filler tokens before final answer at inference

Use expanded computation spaces for problem-solving

Analyze attention dynamics in artificial token sequences

🔎 Similar Papers

From Tokens to Words: On the Inner Lexicon of LLMs