Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Large language models (LLMs) exhibit poor robustness in multiple-choice question answering (MCQA), being highly sensitive to input perturbations. To address this, we propose Token Constraint Decoding (TCD), a post-hoc, model-agnostic decoding method that enforces token-level prediction consistency without fine-tuning. TCD introduces— for the first time—the token-level prediction alignment mechanism into the decoding process, integrating dynamic logit penalization with prompt engineering. Experiments demonstrate that TCD effectively mitigates overconfident predictions and requires model-specific penalty scheduling. On benchmarks including CommonsenseQA, MMLU, and MMLU-Pro, TCD boosts absolute accuracy by up to 39% for weaker models (e.g., Gemma-3B-1B) under noisy inputs, significantly enhancing inference stability in realistic scenarios involving imperfect inputs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.

Problem

Research questions and friction points this paper is trying to address.

Enhances robustness of LLMs against input noise in QA tasks

Improves alignment of token-level predictions for better accuracy

Addresses overconfidence in model outputs through regularization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Constraint Decoding enhances robustness

Algorithm aligns token-level predictions effectively

Combines with prompt engineering for gains

🔎 Similar Papers

No similar papers found.