TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This study systematically investigates whether reasoning mechanisms genuinely enhance the performance of large language models (LLMs) on text classification tasks, carefully weighing their substantial computational and temporal overhead. To this end, we introduce TextReasoningBench, a benchmark that provides the first unified evaluation of seven reasoning strategies—namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT—across ten prominent LLMs and five classification datasets. We propose two efficiency metrics that jointly account for performance gains and token consumption. Our findings reveal that reasoning is not universally beneficial: simple approaches like CoT yield only marginal improvements of 1%–3%, while more complex strategies often fail to improve—or even degrade—performance. Moreover, most methods incur token costs 10 to 100 times higher than baseline inference, resulting in markedly low efficiency.

Technology Category

Application Category

📝 Abstract

Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.

Problem

Research questions and friction points this paper is trying to address.

text classification

reasoning

large language models

efficiency

performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

TextReasoningBench

reasoning efficiency

cost-aware evaluation