Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Multiple-choice question answering (MCQA) suffers from statistical leakage through answer options, inducing models to rely on spurious cues rather than genuine understanding—leading to evaluation distortion and biased reinforcement fine-tuning. To address this, we propose ReVeL, the first framework that reformulates MCQA into verifiable open-ended questions: leveraging LLM-driven classification-based rewriting and a differential verification mechanism to eliminate option leakage, while integrating answer-type-aware GRPO reinforcement learning for unified training and evaluation on multimodal models (e.g., Qwen2.5-VL). Experiments reveal up to a 20-percentage-point inflation in standard MCQA scores. ReVeL preserves original MCQA accuracy while improving open-ended QA performance by ~6%, significantly enhancing reward signal robustness, data efficiency, and evaluation reliability.

Technology Category

Application Category

📝 Abstract

Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

Problem

Research questions and friction points this paper is trying to address.

Replacing multiple-choice questions with open-form questions to prevent answer guessing

Improving evaluation reliability by detecting score inflation in multiple-choice benchmarks

Developing hybrid framework for verifiable reasoning training and robust model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rewrites multiple-choice questions into open-form questions

Categorizes questions by answer types for processing

Uses GRPO to finetune multimodal language models

🔎 Similar Papers

No similar papers found.

Anthropic

$500,000—$850,000 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)