🤖 AI Summary
This work addresses the weak supervision signal for modeling long-range dependencies in long-context training, which arises from insufficient effective context length for target tokens. To tackle the token-level supervision imbalance, the authors propose EXACT, the first method that reweights supervision from the perspective of supervision allocation. EXACT enhances supervision strength for targets with long effective contexts through an inverse-frequency weighting strategy, integrated with document masking, packed training, and effective context analysis. Experiments on Qwen and LLaMA model families demonstrate that EXACT achieves up to a 17.91-point improvement on long-context benchmarks such as RULER, substantially boosting long-context performance while maintaining stable short-context and general reasoning capabilities.
📝 Abstract
Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token's effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis: long-context adaptation depends on how strongly training supervises long-context predictions.