Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing evaluations of large language models’ reasoning capabilities exhibit high sensitivity to answer extraction methods, resulting in unstable and inconsistent assessment outcomes. To address this, we propose Answer Regeneration (AR), an evaluation enhancement framework that decouples the reasoning process from answer extraction. AR introduces an additional reasoning step—prompting the model to regenerate its final answer based on its prior reasoning trace—thereby enabling robust answer extraction independent of heuristic rules. The framework is task-agnostic and applicable to diverse reasoning-intensive settings, including mathematical reasoning and open-domain question answering. Experiments across multiple benchmarks demonstrate that AR significantly improves evaluation robustness and accuracy, mitigating performance fluctuations induced by varying extraction strategies. Overall, AR provides a reliable, general-purpose solution for more stable and trustworthy assessment of LLM reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt"Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' sensitivity to answer extraction methods

Proposing Answer Regeneration to improve evaluation robustness

Applying framework to math problems and open-ended QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Answer Regeneration framework uses additional model inference

Method extracts answers from regenerated model outputs

Approach improves robustness for reasoning model evaluation

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions