VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling

📅 2025-08-23

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address the challenge of jointly optimizing accuracy, efficiency, and context awareness in modeling ultra-long user behavior sequences, this paper proposes VQL—an end-to-end vector-quantized attention framework. VQL innovatively quantizes only the attention keys (not full embeddings), integrating multi-scale grouped quantization with a context-injection mechanism that avoids codebook expansion, enabling cache-efficient, low-distortion compression. It further incorporates separable temporal kernels and static features to support offline index caching and sequence-length-independent inference. Evaluated on three large-scale datasets, VQL significantly outperforms strong baselines, achieving an optimal trade-off between recommendation accuracy (e.g., +0.5–1.2% AUC) and inference latency (30–50% reduction). To our knowledge, VQL is the first method to simultaneously achieve high-fidelity representation, context-aware modeling, and low-latency inference for ultra-long sequences.

Technology Category

Application Category

📝 Abstract

In large-scale recommender systems, ultra-long user behavior sequences encode rich signals of evolving interests. Extending sequence length generally improves accuracy, but directly modeling such sequences in production is infeasible due to latency and memory constraints. Existing solutions fall into two categories: (1) top-k retrieval, which truncates the sequence and may discard most attention mass when L >> k; and (2) encoder-based compression, which preserves coverage but often over-compresses and fails to incorporate key context such as temporal gaps or target-aware signals. Neither class achieves a good balance of low-loss compression, context awareness, and efficiency. We propose VQL, a context-aware Vector Quantization Attention framework for ultra-long behavior modeling, with three innovations. (1) Key-only quantization: only attention keys are quantized, while values remain intact; we prove that softmax normalization yields an error bound independent of sequence length, and a codebook loss directly supervises quantization quality. This also enables L-free inference via offline caches. (2) Multi-scale quantization: attention heads are partitioned into groups, each with its own small codebook, which reduces quantization error while keeping cache size fixed. (3) Efficient context injection: static features (e.g., item category, modality) are directly integrated, and relative position is modeled via a separable temporal kernel. All context is injected without enlarging the codebook, so cached representations remain query-independent. Experiments on three large-scale datasets (KuaiRand-1K, KuaiRec, TMALL) show that VQL consistently outperforms strong baselines, achieving higher accuracy while reducing inference latency, establishing a new state of the art in balancing accuracy and efficiency for ultra-long sequence recommendation.

Problem

Research questions and friction points this paper is trying to address.

Modeling ultra-long user behavior sequences under latency constraints

Balancing compression loss and context awareness in recommendations

Achieving efficient inference without sacrificing sequence coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Key-only quantization with error bound

Multi-scale quantization groups codebooks

Efficient context injection without enlarging codebook

🔎 Similar Papers

Long-Sequence Recommendation Models Need Decoupled Embeddings