SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the significant performance degradation of large language models when processing long numerical sequences, a limitation primarily caused by attention dispersion in the Softmax-based attention mechanism. To mitigate this issue, the authors propose SepSeq, a training-free framework that strategically inserts delimiter tokens into input sequences to recalibrate attention distributions. This approach enables models to focus on local segments while preserving global context, effectively leveraging delimiters as attention sinks—a mechanism newly uncovered in this study. The method operates in a plug-and-play manner, requiring no model retraining or fine-tuning. Evaluated across nine mainstream large language models, SepSeq achieves an average relative accuracy improvement of 35.6% and reduces total token consumption during inference by 16.4%, demonstrating its efficacy and efficiency in enhancing numerical reasoning capabilities.

Technology Category

Application Category

📝 Abstract

While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.

Problem

Research questions and friction points this paper is trying to address.

Long Numerical Sequence

Large Language Models

Attention Dispersion

Transformer

Context Window

Innovation

Methods, ideas, or system contributions that make the work stand out.

SepSeq

attention dispersion

separator tokens