๐ค AI Summary
This work addresses three key challenges in reasoning data construction for large language models: cold-start initialization, limited domain coverage, and reliance on costly human annotation. To overcome these, the authors propose an efficient method for synthesizing high-quality reasoning data by leveraging state-of-the-art large language models to automatically generate 9,000 long-chain-of-thought (CoT) trajectories spanning over 1,000 fine-grained topics across eight scientific domains. A fully automated evaluation pipeline based on cross-verification among strong models ensures data quality without human intervention. Remarkably, fine-tuning a compact 4B-parameter Qwen3 model exclusively on this synthesized dataset achieves performance on par with or approaching that of much larger modelsโsuch as DeepSeek-R1 and Qwen3-235Bโon challenging benchmarks like GPQA-Diamond and AIME, thereby significantly advancing the scalability and generalizability of reasoning data construction.
๐ Abstract
Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.