ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks a systematic understanding of the co-adaptation between reasoning patterns and model scale in large language models (LLMs). Method: We introduce ThinkPatterns-21k—the first benchmark for reasoning-pattern evaluation—comprising 21k instruction-response pairs annotated with five reasoning pattern categories. We propose a standardized evaluation framework spanning 3B–32B parameter models and perform multi-paradigm chain-of-thought augmentation (manual and semi-automatic) on instruction-tuning data. Contribution/Results: We uncover the first empirical “model-scale–reasoning-pattern” alignment law: smaller models (<30B) consistently benefit from structured reasoning (e.g., decomposition, self-debate), whereas the 32B model exhibits performance degradation under such prompting; in contrast, monolithic monologue reasoning yields robust, scale-invariant gains. All resources—including the dataset, model checkpoints, and training logs—are publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated enhanced performance through the extit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.
Problem

Research questions and friction points this paper is trying to address.

Impact of thinking patterns on LLM performance
Systematic study across different model sizes
Effectiveness of structured vs unstructured thinking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ThinkPatterns-21k dataset with 21k QA pairs
Analyzes impact of five thinking patterns on LLMs
Releases datasets and checkpoints for reproducibility
🔎 Similar Papers
No similar papers found.
P
Pengcheng Wen
Hong Kong University of Science and Technology
J
Jiaming Ji
Peking University
Chi-Min Chan
Chi-Min Chan
HKUST
Large Language ModelsPost-TrainingAlignmentLLM Agents
J
Juntao Dai
Zhejiang University
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
Y
Yaodong Yang
Peking University
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology