Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes ARACH, a training-free, plug-and-play inference-time intervention that dynamically optimizes the internal reasoning process of large language models without modifying their parameters. Unlike existing post-training optimization methods that primarily operate on inputs or outputs, ARACH adaptively constructs contextual anchors and redistributes global attention to mitigate the attention sink phenomenon. The approach incurs minimal computational overhead while enabling fine-grained control over internal model computations during inference. Experimental results demonstrate that ARACH consistently enhances performance across multiple language modeling benchmarks, significantly outperforming conventional prompt engineering and resampling strategies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.
Problem

Research questions and friction points this paper is trying to address.

training-free
inference-time
attention reallocation
large language models
internal computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
inference-time plug-in
attention reallocation
adaptive context hub
internal computation intervention
🔎 Similar Papers
No similar papers found.
Jingtao Wang
Jingtao Wang
Google
Machine LearningLarge Language ModelsOn-Device Fundation ModelsEducational Technology
Y
Yucong Wang
The College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China; Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou, China
Jun Ding
Jun Ding
McGill University
Computational biologyMachine leaningCell dynamicsSingle-cell genomicsVirtual cells
Rui Cai
Rui Cai
University of California, Davis
Machine Learning
X
Xun Wang
The College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China; Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou, China