🤖 AI Summary
This paper addresses the loss of structural information in pre-softmax query-key logits caused by softmax normalization in autoregressive Transformers. To tackle this, we propose RCStat—a Relative Contextualization-based token importance scoring framework. RCStat introduces a novel relative contextualization random variable and derives an efficiently computable upper bound for it, enabling adaptive key-value cache compression and high-fidelity attribution explanations without model retraining. By directly modeling raw query-key logits—rather than post-softmax attention weights—and integrating pre-softmax signal analysis with a dynamic thresholding mechanism, RCStat preserves fine-grained structural dependencies. Evaluated on question answering, summarization, and attribution tasks, RCStat achieves state-of-the-art performance: it significantly reduces cache size while maintaining near-lossless inference quality.
📝 Abstract
Prior work on input-token importance in auto-regressive transformers has relied on Softmax-normalized attention weights, which obscure the richer structure of pre-Softmax query-key logits. We introduce RCStat, a statistical framework that harnesses raw attention logits via Relative Contextualization (RC), a random variable measuring contextual alignment between token segments, and derive an efficient upper bound for RC. We demonstrate two applications: (i) Key-Value compression, where RC-based thresholds drive adaptive key-value eviction for substantial cache reduction with minimal quality loss; and (ii) Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level explanations than post-Softmax methods. Across question answering, summarization, and attribution benchmarks, RCStat achieves significant empirical gains, delivering state-of-the-art compression and attribution performance without any model retraining.