Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

First-token probability (FTP)-based symbolic evaluation for multiple-choice question answering (MCQA) suffers from reduced reliability due to token misalignment—where high-probability but irrelevant tokens dominate—and semantic misinterpretation—where valid answer tokens are suppressed by generic prefixes. Method: We propose a zero-parameter, zero-fine-tuning natural-language prefilling prefix (e.g., “The correct option is:”) that explicitly steers the model to generate standardized answer-starting tokens. This marks the first adaptation of prefilling—a technique originally developed in AI safety—to MCQA evaluation. Contribution/Results: Evaluated on MMLU, ARC, and HellaSwag, our method significantly improves FTP accuracy, calibration, and output stability. It matches the performance of costly generative approaches requiring full autoregressive decoding plus external classifiers, while reducing inference overhead by an order of magnitude—effectively bridging the gap between efficient symbolic evaluation and resource-intensive generative evaluation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g.,"*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

Problem

Research questions and friction points this paper is trying to address.

FTP fragility in MCQA due to misalignment and misinterpretation

Prefilling attack improves accuracy and consistency in LLM evaluations

Enhancing FTP reliability without model modification or full decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefilling attack steers model output

Structured prefix improves FTP reliability

No parameter modification enhances efficiency

🔎 Similar Papers

No similar papers found.