LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
This work addresses the challenges of high I/O latency and the quadratic complexity of attention mechanisms in modeling ultra-long user behavior sequences by proposing LASER, a framework that achieves efficient end-to-end learning through co-optimization of system and algorithmic design. Its key innovations include SeqVault, a unified storage architecture leveraging a hybrid DRAM-SSD index to substantially reduce I/O overhead, and Segmented Target Attention (STA), which integrates Sigmoid gating with a lightweight global Stacked Target Attention (GSTA) to compress sequences while preserving critical interest signals. Experimental results demonstrate that LASER outperforms existing state-of-the-art methods on offline metrics, and large-scale online A/B tests involving over 100 million daily active users show a 2.36% increase in ADV V and a 2.08% boost in revenue.

Technology Category

Application Category

📝 Abstract
Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict"Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
Problem

Research questions and friction points this paper is trying to address.

long sequence modeling
recommendation systems
latency bottleneck
attention mechanism
user behavior sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmented Target Attention
SeqVault
Long Sequence Modeling
Hybrid DRAM-SSD Indexing
Target-Aware Attention
Tianhe Lin
Tianhe Lin
Fudan University
NLPLarge Language Models
Z
Ziwei Xiong
Xiaohongshu Inc., Shanghai, China
B
Baoyuan Ou
Xiaohongshu Inc., Shanghai, China
Y
Yingjie Qin
Xiaohongshu Inc., Shanghai, China
L
Lai Xu
Xiaohongshu Inc., Shanghai, China
X
Xiaocheng Zhong
Xiaohongshu Inc., Shanghai, China
Yao Hu
Yao Hu
浙江大学
Machine Learning
Z
Zhiyong Wang
Xiaohongshu Inc., Shanghai, China
T
Tao Zhou
Xiaohongshu Inc., Shanghai, China
Y
Yubin Xu
Xiaohongshu Inc., Shanghai, China
D
Di Wu
Xiaohongshu Inc., Shanghai, China