Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work proposes ARACH, a training-free, plug-and-play inference-time intervention that dynamically optimizes the internal reasoning process of large language models without modifying their parameters. Unlike existing post-training optimization methods that primarily operate on inputs or outputs, ARACH adaptively constructs contextual anchors and redistributes global attention to mitigate the attention sink phenomenon. The approach incurs minimal computational overhead while enabling fine-grained control over internal model computations during inference. Experimental results demonstrate that ARACH consistently enhances performance across multiple language modeling benchmarks, significantly outperforming conventional prompt engineering and resampling strategies.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.

Problem

Research questions and friction points this paper is trying to address.

training-free

inference-time

attention reallocation

large language models

internal computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

inference-time plug-in

attention reallocation