ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching

๐Ÿ“… 2026-01-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing efficient attention mechanisms struggle to balance coverage and computational efficiency in long-context modeling. This work proposes ROSA-Tuning, the first approach to integrate a Run-Length Online Suffix Automaton (ROSA) with large language models. ROSA-Tuning constructs ROSA structures in parallel on the CPU to retrieve historically relevant token positions and injects this retrieved information into the modelโ€™s internal states in a trainable manner, fusing it with windowed attention through learned weighting. The method incorporates binary discretization, counterfactual gradients, and an asynchronous CPU-GPU pipeline to enable end-to-end training. Evaluated on Qwen3-Base-1.7B, ROSA-Tuning substantially recovers the long-range modeling capability of windowed attention, achieving performance on LongBench and other benchmarks that approaches or matches full global attention, while maintaining computational efficiency and GPU memory usage comparable to windowed attention.

Technology Category

Application Category

๐Ÿ“ Abstract
Long-context capability and computational efficiency are among the central challenges facing today's large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning leverages in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we employ the binary discretization strategy and the counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at https://github.com/zyaaa-ux/ROSA-Tuning.
Problem

Research questions and friction points this paper is trying to address.

long-context modeling
computational efficiency
large language models
attention mechanism
windowed attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

ROSA-Tuning
long-context modeling
suffix automaton
efficient attention
asynchronous CPU-GPU pipeline
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yunao Zheng
Beijing University of Posts and Telecommunications (BUPT), Beijing, China
Xiaojie Wang
Xiaojie Wang
Beijing University of Posts and Telecommunications
Natural Language ProcessingVisually Language GroundingDialogue
Lei Ren
Lei Ren
Li Auto
NLPใ€LLMใ€VLM
W
Wei Chen
Li Auto Inc., Beijing, China