SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from insufficient long-context alignment in realistic long-text scenarios, primarily due to low-quality training data, inefficient optimization, and suboptimal reward design. Method: We propose Short-to-Long preference Optimization (SoLo), a novel framework that decouples long-context alignment into two stages: short-context preference optimization and short-to-long reward consistency alignment. Central to SoLo is the first-ever Short-to-Long Reward Alignment (SoLo-RA) mechanism, theoretically guaranteeing effective transfer of short-context optimization gains to long contexts. The framework integrates short-to-long consistency regularization and multi-scale context sampling, and is compatible with mainstream preference optimization algorithms. Results: Extensive experiments on multiple long-context benchmarks demonstrate significant improvements in length and domain generalization, a 40% reduction in training data requirements, a 35% decrease in GPU memory consumption, and a 22% average improvement in inference response quality.

Technology Category

Application Category

📝 Abstract
Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $ extbf{S}$h$ extbf{o}$rt-to-$ extbf{Lo}$ng $ extbf{P}$reference $ extbf{O}$ptimization ($ extbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.
Problem

Research questions and friction points this paper is trying to address.

Improving LLMs' long-context utilization via alignment optimization
Addressing data quality and training inefficiencies in long-context tasks
Enhancing short-to-long context transfer for better generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Short-to-Long Preference Optimization (SoLoPO)
Decouples long-context into short-context optimization
Enhances efficiency in data and training
🔎 Similar Papers
No similar papers found.
Huashan Sun
Huashan Sun
Beijing Institute of Technology
AINLP
S
Shengyi Liao
Tongyi Lab, Alibaba Group
Y
Yansen Han
Y
Yu Bai
Beijing Institute of Technology
Y
Yang Gao
Tongyi Lab, Alibaba Group
Cheng Fu
Cheng Fu
Institute of Software, Chinese Academy of Sciences
Entity ResolutionKnowledge GraphNatural Language Processing
Weizhou Shen
Weizhou Shen
Tongyi Lab, Alibaba Group
Fanqi Wan
Fanqi Wan
Sun Yat-sen University
NLPLLMs
M
Ming Yan
Tongyi Lab, Alibaba Group
J
Ji Zhang
Tongyi Lab, Alibaba Group
F
Fei Huang
Tongyi Lab, Alibaba Group