🤖 AI Summary
This work addresses the performance degradation of large language models in long-context reasoning, which stems from attention dilution and reduced out-of-distribution generalization. Existing approaches typically rely on fixed context budgets, failing to accommodate the varying contextual demands of individual tokens. To overcome this limitation, the paper proposes UT-ACA, an adaptive context allocation framework that dynamically allocates context resources based on per-token uncertainty. UT-ACA integrates semantic embeddings and logit confidence to estimate uncertainty in real time and models its cumulative effect throughout decoding. When evidence is insufficient, the framework triggers context fallback, expansion, and token regeneration mechanisms. This approach enables on-demand context utilization for the first time, significantly reducing average context consumption while preserving generation quality and enhancing inference efficiency.
📝 Abstract
Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.