FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

πŸ“… 2026-05-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

219K/year
πŸ€– AI Summary
This work addresses the challenge of attention dilution in long-context supervised fine-tuning of large language models, where dispersed critical information is inadequately attended to. The authors propose FocuSFT, the first approach to introduce bilevel optimization into this setting: an inner loop employs lightweight parametric memory to enhance attention on key content, while an outer loop performs supervised fine-tuning atop this enhanced representation, incorporating bidirectional contextual attention to mitigate causal asymmetry without compromising the causal mask during generation. This framework explicitly models and alleviates attention dilution during training. Experiments demonstrate that FocuSFT achieves up to a 14-percentage-point accuracy gain on BABILong, reaches 81.1% on RULER’s CWE metric, and yields a 24% relative improvement in pass@1 on GPQA tool-use tasks. Attention analysis further reveals a 529-fold reduction in attention sink effects and a threefold increase in context engagement.
πŸ“ Abstract
Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT
Problem

Research questions and friction points this paper is trying to address.

long-context
attention dilution
supervised fine-tuning
attention sinks
bilevel optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

bilevel optimization
attention dilution
long-context fine-tuning
fast-weight memory
bidirectional attention