🤖 AI Summary
To address the weak generalization and tendency toward memorization—rather than transfer—of large language models (LLMs) under supervised fine-tuning, this paper proposes Omni-Think, the first framework unifying rule-verifiable rewards with LLM-as-a-Judge generative preference signals within a multitask reinforcement learning paradigm applicable to both structured and open-ended tasks. It introduces a task-aware curriculum learning strategy that progressively sequences training from structured to open-ended tasks, mitigating catastrophic forgetting and enhancing scalable optimization for subjective tasks. Experiments across four domains demonstrate that Omni-Think outperforms joint training by 5.2% and model merging by 9.1%, while significantly improving cross-domain generalization and training stability.
📝 Abstract
The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Think, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.