Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

It remains unclear whether the interpretability of chain-of-thought (CoT) reasoning traces is necessary for enhancing large language model (LLM) performance. Method: We conduct supervised fine-tuning experiments on open-book question answering using LLaMA, Qwen, and DeepSeek-R1, systematically comparing four classes of reasoning traces—human-written, algorithmically generated, post-hoc explanation-based, and correctness-guaranteed. Contribution/Results: We find that high-performing traces (e.g., those generated by DeepSeek-R1) exhibit significant negative correlation with human comprehensibility; high task accuracy does not require high trace readability. Accordingly, we propose a new paradigm decoupling *intermediate reasoning representations* from *user-facing interpretability*, challenging the prevailing assumption that CoT must be human-readable. Through controlled human-subject experiments, we quantitatively characterize the performance–interpretability trade-off, providing both theoretical grounding and practical guidance for developing efficient and trustworthy reasoning models.

Technology Category

Application Category

📝 Abstract

Recent progress in reasoning-oriented Large Language Models (LLMs) has been driven by introducing Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an answer. These traces, as in DeepSeek R1, are not only used to guide inference but also serve as supervision signals for distillation into smaller models. A common but often implicit assumption is that CoT traces should be semantically meaningful and interpretable to the end user. While recent research questions the need for semantic nature of these traces, in this paper, we ask: `` extit{Must CoT reasoning traces be interpretable to enhance LLM task performance?}" We investigate this question in the Open Book Question-Answering domain by supervised fine-tuning LLaMA and Qwen models on four types of reasoning traces: (1) DeepSeek R1 traces, (2) LLM-generated summaries of R1 traces, (3) LLM-generated post-hoc explanations of R1 traces, and (4) algorithmically generated verifiably correct traces. To quantify the trade-off between interpretability and performance, we further conduct a human-subject study with 100 participants rating the interpretability of each trace type. Our results reveal a striking mismatch: while fine-tuning on R1 traces yields the strongest performance, participants judged these traces to be the least interpretable. These findings suggest that it is useful to decouple intermediate tokens from end user interpretability.

Problem

Research questions and friction points this paper is trying to address.

Investigating if interpretable reasoning traces boost LLM performance

Exploring trade-off between trace interpretability and task accuracy

Decoupling intermediate tokens from user interpretability for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithmically generated verifiably correct traces

Supervised fine-tuning on four reasoning types

Decoupling intermediate tokens from interpretability

🔎 Similar Papers

No similar papers found.