🤖 AI Summary
Existing evaluations of large language models’ reasoning capabilities exhibit high sensitivity to answer extraction methods, resulting in unstable and inconsistent assessment outcomes. To address this, we propose Answer Regeneration (AR), an evaluation enhancement framework that decouples the reasoning process from answer extraction. AR introduces an additional reasoning step—prompting the model to regenerate its final answer based on its prior reasoning trace—thereby enabling robust answer extraction independent of heuristic rules. The framework is task-agnostic and applicable to diverse reasoning-intensive settings, including mathematical reasoning and open-domain question answering. Experiments across multiple benchmarks demonstrate that AR significantly improves evaluation robustness and accuracy, mitigating performance fluctuations induced by varying extraction strategies. Overall, AR provides a reliable, general-purpose solution for more stable and trustworthy assessment of LLM reasoning capabilities.
📝 Abstract
Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt"Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.