SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluations predominantly focus on single-answer multiple-choice tasks, whereas real-world applications frequently require selecting *all* correct answers from a set (Select-All-That-Apply, SATA). This critical capability remains systematically unassessed. To address this gap, we introduce SATA-BENCH—the first cross-domain benchmark for SATA evaluation, covering reading comprehension, law, and biomedicine—and reveal that state-of-the-art models achieve only 41.8% exact-match accuracy. We conduct the first systematic diagnosis of two fundamental biases undermining SATA performance: *choice bias* (uneven token-level confidence across options) and *quantity bias* (over-/under-prediction of answer count). To mitigate them, we propose Choice Funnel—a novel decoding strategy integrating token-level debiasing, dynamic thresholding, and joint multi-answer decoding. Our method improves exact-match accuracy by up to 29% while reducing inference cost by over 64%. We publicly release SATA-BENCH, along with code and models, to advance research on reliable multi-answer decision-making.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on identifying all correct answers in multiple-choice questions
Addressing selection and count biases in LLMs for multi-answer tasks
Proposing a decoding strategy to improve accuracy and reduce inference costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SATA-BENCH for multi-answer evaluation
Proposes Choice Funnel decoding strategy
Combines token debiasing with adaptive thresholding
🔎 Similar Papers
No similar papers found.