DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This paper addresses the pervasive “over-reasoning” problem in large language models (LLMs)—characterized by redundant token generation, futile reasoning attempts, and excessive responses—during complex reasoning tasks. To systematically evaluate and analyze this phenomenon, the authors introduce DNA Bench, the first dedicated benchmark comprising 150 adversarial prompts; they formally define and quantify over-reasoning and propose a novel evaluation paradigm wherein “proactive silence” (i.e., withholding output when inappropriate) constitutes the correct response. The framework decomposes model capability into four dimensions: instruction adherence, hallucination suppression, redundancy filtering, and unanswerable-query identification. Experiments reveal that state-of-the-art reasoning models (e.g., DeepSeek-R1, Claude-3.7) generate, on average, 70× more redundant tokens than necessary and exhibit a 32% lower refusal accuracy than GPT-4o on simple unanswerable queries—paradoxically underperforming non-reasoning models. These findings establish a new empirical and theoretical foundation for diagnosing and improving LLM reasoning efficiency.

Technology Category

Application Category

📝 Abstract

Test-time scaling has significantly improved large language model performance, enabling deeper reasoning to solve complex problems. However, this increased reasoning capability also leads to excessive token generation and unnecessary problem-solving attempts. We introduce Don't Answer Bench (DNA Bench), a new benchmark designed to evaluate LLMs ability to robustly understand the tricky reasoning triggers and avoiding unnecessary generation. DNA Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to, but surprisingly not for many of the recent prominent LLMs. DNA Bench tests models abilities across different capabilities, such as instruction adherence, hallucination avoidance, redundancy filtering, and unanswerable question recognition. We evaluate reasoning LLMs (RLMs), including DeepSeek-R1, OpenAI O3-mini, Claude-3.7-sonnet and compare them against a powerful non-reasoning model, e.g., GPT-4o. Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy. Our findings underscore the need for more effective training and inference strategies in RLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' ability to avoid unnecessary token generation.

Assess LLMs' robustness in handling tricky reasoning triggers.

Compare reasoning LLMs' efficiency and accuracy with non-reasoning models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

DNA Bench evaluates LLMs' reasoning triggers avoidance.

Adversarial prompts test instruction adherence, hallucination avoidance.

RLMs generate excessive tokens, need better training strategies.

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting