RCStat: A Statistical Framework for using Relative Contextualization in Transformers

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This paper addresses the loss of structural information in pre-softmax query-key logits caused by softmax normalization in autoregressive Transformers. To tackle this, we propose RCStat—a Relative Contextualization-based token importance scoring framework. RCStat introduces a novel relative contextualization random variable and derives an efficiently computable upper bound for it, enabling adaptive key-value cache compression and high-fidelity attribution explanations without model retraining. By directly modeling raw query-key logits—rather than post-softmax attention weights—and integrating pre-softmax signal analysis with a dynamic thresholding mechanism, RCStat preserves fine-grained structural dependencies. Evaluated on question answering, summarization, and attribution tasks, RCStat achieves state-of-the-art performance: it significantly reduces cache size while maintaining near-lossless inference quality.

Technology Category

Application Category

📝 Abstract

Prior work on input-token importance in auto-regressive transformers has relied on Softmax-normalized attention weights, which obscure the richer structure of pre-Softmax query-key logits. We introduce RCStat, a statistical framework that harnesses raw attention logits via Relative Contextualization (RC), a random variable measuring contextual alignment between token segments, and derive an efficient upper bound for RC. We demonstrate two applications: (i) Key-Value compression, where RC-based thresholds drive adaptive key-value eviction for substantial cache reduction with minimal quality loss; and (ii) Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level explanations than post-Softmax methods. Across question answering, summarization, and attribution benchmarks, RCStat achieves significant empirical gains, delivering state-of-the-art compression and attribution performance without any model retraining.

Problem

Research questions and friction points this paper is trying to address.

Analyzing pre-Softmax attention logits for better token importance

Enhancing Key-Value cache compression with adaptive eviction thresholds

Improving token and sentence-level attribution fidelity in transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses raw attention logits via Relative Contextualization

Applies RC for Key-Value cache compression

Enhances token attribution fidelity with RC

🔎 Similar Papers

Loss Landscape Degeneracy Drives Stagewise Development in Transformers