LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the challenges of high I/O latency and the quadratic complexity of attention mechanisms in modeling ultra-long user behavior sequences by proposing LASER, a framework that achieves efficient end-to-end learning through co-optimization of system and algorithmic design. Its key innovations include SeqVault, a unified storage architecture leveraging a hybrid DRAM-SSD index to substantially reduce I/O overhead, and Segmented Target Attention (STA), which integrates Sigmoid gating with a lightweight global Stacked Target Attention (GSTA) to compress sequences while preserving critical interest signals. Experimental results demonstrate that LASER outperforms existing state-of-the-art methods on offline metrics, and large-scale online A/B tests involving over 100 million daily active users show a 2.36% increase in ADV V and a 2.08% boost in revenue.

Technology Category

Application Category

📝 Abstract

Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict"Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.

Problem

Research questions and friction points this paper is trying to address.

long sequence modeling

recommendation systems

latency bottleneck

attention mechanism

user behavior sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmented Target Attention

SeqVault

Long Sequence Modeling