Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

📅 2025-07-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak generalization and tendency toward memorization—rather than transfer—of large language models (LLMs) under supervised fine-tuning, this paper proposes Omni-Think, the first framework unifying rule-verifiable rewards with LLM-as-a-Judge generative preference signals within a multitask reinforcement learning paradigm applicable to both structured and open-ended tasks. It introduces a task-aware curriculum learning strategy that progressively sequences training from structured to open-ended tasks, mitigating catastrophic forgetting and enhancing scalable optimization for subjective tasks. Experiments across four domains demonstrate that Omni-Think outperforms joint training by 5.2% and model merging by 9.1%, while significantly improving cross-domain generalization and training stability.

Technology Category

Application Category

📝 Abstract
The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Think, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM generalization across diverse tasks
Combining rule-based and preference-based rewards in RL
Improving performance via curriculum-based task progression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task RL with hybrid rewards
Curriculum-based task progression
LLM-as-a-Judge evaluations
🔎 Similar Papers
No similar papers found.
D
Derek Li
Huawei Noah’s Ark Lab, Montréal, Canada
J
Jiaming Zhou
Huawei Noah’s Ark Lab, Montréal, Canada
Amirreza Kazemi
Amirreza Kazemi
Stability AI
Deep LearningGenerative ModelsReinforcement Learning
Q
Qianyi Sun
Huawei Noah’s Ark Lab, Montréal, Canada
Abbas Ghaddar
Abbas Ghaddar
University of Montreal
Artificial IntelligenceNatural Language ProcessingMachine LearningDeep Learning
Mohammad Ali Alomrani
Mohammad Ali Alomrani
University of Toronto
Machine Learning
Liheng Ma
Liheng Ma
PhD student, McGill University & Mila.
Geometric Deep LearningGraph Neural NetworksTime SeriesMachine Learning
Y
Yu Luo
Huawei Noah’s Ark Lab, Beijing, China
D
Dong Li
Huawei Noah’s Ark Lab, Beijing, China
F
Feng Wen
Huawei Noah’s Ark Lab, Montréal, Canada
Jianye Hao
Jianye Hao
Huawei Noah's Ark Lab/Tianjin University
Multiagent SystemsEmbodied AI
Mark Coates
Mark Coates
Professor of Electrical Engineering, McGill University
Signal ProcessingComputer Networks
Y
Yingxue Zhang
Huawei Noah’s Ark Lab, Montréal, Canada