ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing efficient attention mechanisms struggle to balance coverage and computational efficiency in long-context modeling. This work proposes ROSA-Tuning, the first approach to integrate a Run-Length Online Suffix Automaton (ROSA) with large language models. ROSA-Tuning constructs ROSA structures in parallel on the CPU to retrieve historically relevant token positions and injects this retrieved information into the model’s internal states in a trainable manner, fusing it with windowed attention through learned weighting. The method incorporates binary discretization, counterfactual gradients, and an asynchronous CPU-GPU pipeline to enable end-to-end training. Evaluated on Qwen3-Base-1.7B, ROSA-Tuning substantially recovers the long-range modeling capability of windowed attention, achieving performance on LongBench and other benchmarks that approaches or matches full global attention, while maintaining computational efficiency and GPU memory usage comparable to windowed attention.

Technology Category

Application Category

📝 Abstract

Long-context capability and computational efficiency are among the central challenges facing today's large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning leverages in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we employ the binary discretization strategy and the counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at https://github.com/zyaaa-ux/ROSA-Tuning.

Problem

Research questions and friction points this paper is trying to address.

long-context modeling

computational efficiency

large language models

attention mechanism

windowed attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

ROSA-Tuning

long-context modeling

suffix automaton