Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the weak generalization and tendency toward memorization—rather than transfer—of large language models (LLMs) under supervised fine-tuning, this paper proposes Omni-Think, the first framework unifying rule-verifiable rewards with LLM-as-a-Judge generative preference signals within a multitask reinforcement learning paradigm applicable to both structured and open-ended tasks. It introduces a task-aware curriculum learning strategy that progressively sequences training from structured to open-ended tasks, mitigating catastrophic forgetting and enhancing scalable optimization for subjective tasks. Experiments across four domains demonstrate that Omni-Think outperforms joint training by 5.2% and model merging by 9.1%, while significantly improving cross-domain generalization and training stability.

Technology Category

Application Category

📝 Abstract

The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Think, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM generalization across diverse tasks

Combining rule-based and preference-based rewards in RL

Improving performance via curriculum-based task progression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task RL with hybrid rewards

Curriculum-based task progression

LLM-as-a-Judge evaluations

🔎 Similar Papers

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks