A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing single-answer benchmarks for open-domain question answering overlook multi-answer ambiguity, leading to distorted training signals; moreover, manual annotation of multiple answers does not scale to multi-hop settings. To address this, we propose A²Search—the first end-to-end, annotation-free reinforcement learning framework for multi-answer QA. A²Search leverages large language models to sample reasoning trajectories and verify supporting evidence, automatically discovering alternative answers without human supervision. It introduces AnsF1—a reward function specifically designed for multi-answer evaluation—to guide policy optimization. The method tightly integrates multi-hop reasoning, verifiable search, and robust reward modeling. Evaluated on eight benchmarks, A²Search achieves state-of-the-art performance: A²Search-7B attains an average AnsF1@1 of 48.4% across four multi-hop tasks, significantly outperforming the larger ReSearch-32B. This demonstrates both the effectiveness and scalability of lightweight models for ambiguous, multi-answer QA.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $mathrm{AnsF1}@1$ score of $48.4%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search

Problem

Research questions and friction points this paper is trying to address.

Handling ambiguous questions with multiple valid answers

Developing annotation-free training for multi-hop question answering

Optimizing models using reinforcement learning for diverse answer sets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline detects ambiguous questions and answers

Model optimized with reinforcement learning and AnsF1 reward

Annotation-free end-to-end training framework handles ambiguity

🔎 Similar Papers

No similar papers found.