ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Prior work lacks a systematic understanding of the co-adaptation between reasoning patterns and model scale in large language models (LLMs). Method: We introduce ThinkPatterns-21k—the first benchmark for reasoning-pattern evaluation—comprising 21k instruction-response pairs annotated with five reasoning pattern categories. We propose a standardized evaluation framework spanning 3B–32B parameter models and perform multi-paradigm chain-of-thought augmentation (manual and semi-automatic) on instruction-tuning data. Contribution/Results: We uncover the first empirical “model-scale–reasoning-pattern” alignment law: smaller models (<30B) consistently benefit from structured reasoning (e.g., decomposition, self-debate), whereas the 32B model exhibits performance degradation under such prompting; in contrast, monolithic monologue reasoning yields robust, scale-invariant gains. All resources—including the dataset, model checkpoints, and training logs—are publicly released.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated enhanced performance through the extit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.

Problem

Research questions and friction points this paper is trying to address.

Impact of thinking patterns on LLM performance

Systematic study across different model sizes

Effectiveness of structured vs unstructured thinking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ThinkPatterns-21k dataset with 21k QA pairs

Analyzes impact of five thinking patterns on LLMs

Releases datasets and checkpoints for reproducibility

🔎 Similar Papers

No similar papers found.