🤖 AI Summary
Social network services (SNS) exhibit dynamic multilinguality, high heterogeneity, and strong distributional shift—posing a fundamental trade-off between in-distribution performance and out-of-distribution robustness during supervised fine-tuning (SFT) of large language models (LLMs), especially for smaller-scale models. To address this, we propose a three-stage, reinforcement learning (RL)-dominant progressive post-training framework: (i) exploratory learning on curated SNS corpora; (ii) selective SFT combined with a mixed general-domain data rehearsal mechanism to mitigate catastrophic forgetting; and (iii) RL-based refinement guided by SNS-centric signals. This paradigm pioneers RL-first, stage-wise balanced optimization. On a 4B-parameter model, it achieves +2.41 points over a 7B baseline and +8.74 points over the base model, surpassing prior methods using <50% of their training data—demonstrating substantial gains in data efficiency, training stability, and cross-lingual robustness for small LLMs.
📝 Abstract
As a key medium for human interaction and information exchange, social networking services (SNS) pose unique challenges for large language models (LLMs): heterogeneous workloads, fast-shifting norms and slang, and multilingual, culturally diverse corpora that induce sharp distribution shift. Supervised fine-tuning (SFT) can specialize models but often triggers a ``seesaw''between in-distribution gains and out-of-distribution robustness, especially for smaller models. To address these challenges, we introduce RedOne 2.0, an SNS-oriented LLM trained with a progressive, RL-prioritized post-training paradigm designed for rapid and stable adaptation. The pipeline consist in three stages: (1) Exploratory Learning on curated SNS corpora to establish initial alignment and identify systematic weaknesses; (2) Targeted Fine-Tuning that selectively applies SFT to the diagnosed gaps while mixing a small fraction of general data to mitigate forgetting; and (3) Refinement Learning that re-applies RL with SNS-centric signals to consolidate improvements and harmonize trade-offs across tasks. Across various tasks spanning three categories, our 4B scale model delivers an average improvements about 2.41 over the 7B sub-optimal baseline. Additionally, RedOne 2.0 achieves average performance lift about 8.74 from the base model with less than half the data required by SFT-centric method RedOne, evidencing superior data efficiency and stability at compact scales. Overall, RedOne 2.0 establishes a competitive, cost-effective baseline for domain-specific LLMs in SNS scenario, advancing capability without sacrificing robustness.