๐ค AI Summary
Current large reasoning models (LRMs) rely heavily on macro-level evaluation metrics (e.g., accuracy, step count) and lack systematic, fine-grained characterization of their intrinsic reasoning patterns.
Method: We propose LOT (Language of Thought), the first interpretable, automatically generated natural language reasoning taxonomy. LOT is constructed from reasoning traces across mathematical, scientific, and programming tasks, enabling fine-grained cognitive modeling of 12 open-source LRMs. It integrates generative feature extraction, empirical distribution modeling, and iterative classification to capture model-specific reasoning behaviors.
Contribution/Results: LOT achieves 80โ100% model attribution accuracy and enables reasoning-style alignment for smaller modelsโboosting Qwen3โs GPQA accuracy by 3.3โ5.7%. This work establishes the first open, reasoning-difference-aware classification framework for LRMs, advancing model diagnosis, knowledge distillation, and controllable reasoning.
๐ Abstract
Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT's natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.