SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large language models (LLMs) suffer from insufficient long-context alignment in realistic long-text scenarios, primarily due to low-quality training data, inefficient optimization, and suboptimal reward design. Method: We propose Short-to-Long preference Optimization (SoLo), a novel framework that decouples long-context alignment into two stages: short-context preference optimization and short-to-long reward consistency alignment. Central to SoLo is the first-ever Short-to-Long Reward Alignment (SoLo-RA) mechanism, theoretically guaranteeing effective transfer of short-context optimization gains to long contexts. The framework integrates short-to-long consistency regularization and multi-scale context sampling, and is compatible with mainstream preference optimization algorithms. Results: Extensive experiments on multiple long-context benchmarks demonstrate significant improvements in length and domain generalization, a 40% reduction in training data requirements, a 35% decrease in GPU memory consumption, and a 22% average improvement in inference response quality.

Technology Category

Application Category

📝 Abstract

Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $ extbf{S}$h$ extbf{o}$rt-to-$ extbf{Lo}$ng $ extbf{P}$reference $ extbf{O}$ptimization ($ extbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

Problem

Research questions and friction points this paper is trying to address.

Improving LLMs' long-context utilization via alignment optimization

Addressing data quality and training inefficiencies in long-context tasks

Enhancing short-to-long context transfer for better generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Short-to-Long Preference Optimization (SoLoPO)

Decouples long-context into short-context optimization

Enhances efficiency in data and training

🔎 Similar Papers

No similar papers found.