Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Chain-of-Thought (CoT) datasets suffer from limited scale, narrow sourcing, and absence of fine-grained annotations, hindering the development of large reasoning models (LRMs). To address this, we propose OmniThought—the first large-scale CoT dataset comprising 2 million high-quality samples, collaboratively generated and verified by dual teacher LLMs. We introduce a novel automatic annotation framework quantifying two orthogonal dimensions: Reasoning Verbosity (RV) and Cognitive Difficulty (CD), calibrated via human annotation and regression modeling. Furthermore, we design a fully automated, self-reliant distillation–verification pipeline using the Qwen2.5 series for multi-scale knowledge distillation and consistency filtering. Models trained on OmniThought achieve state-of-the-art performance on mathematical and coding benchmarks, with significant improvements in precise CoT length control and adaptive difficulty alignment.

Technology Category

Application Category

📝 Abstract
The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD) scores, which describe the appropriateness of CoT verbosity and cognitive difficulty level for models to comprehend these reasoning processes. We further establish a self-reliant pipeline to curate this dataset. Extensive experiments using Qwen2.5 models of various sizes demonstrate the positive impact of our proposed scores on LRM training effectiveness. Based on the proposed OmniThought dataset, we further train and release a series of high-performing LRMs, specifically equipped with stronger reasoning abilities and optimal CoT output length and difficulty level. Our contributions significantly enhance the development and training of LRMs for solving complex tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive CoT datasets for large reasoning models
Missing annotations for reasoning verbosity and cognitive difficulty in CoTs
Need for better training data to enhance LRMs' reasoning abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale CoT dataset with verbosity and difficulty annotations
Self-reliant pipeline for dataset curation
Training high-performing LRMs with optimal CoT outputs