Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit poor robustness in multiple-choice question answering (MCQA), being highly sensitive to input perturbations. To address this, we propose Token Constraint Decoding (TCD), a post-hoc, model-agnostic decoding method that enforces token-level prediction consistency without fine-tuning. TCD introduces— for the first time—the token-level prediction alignment mechanism into the decoding process, integrating dynamic logit penalization with prompt engineering. Experiments demonstrate that TCD effectively mitigates overconfident predictions and requires model-specific penalty scheduling. On benchmarks including CommonsenseQA, MMLU, and MMLU-Pro, TCD boosts absolute accuracy by up to 39% for weaker models (e.g., Gemma-3B-1B) under noisy inputs, significantly enhancing inference stability in realistic scenarios involving imperfect inputs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.
Problem

Research questions and friction points this paper is trying to address.

Enhances robustness of LLMs against input noise in QA tasks
Improves alignment of token-level predictions for better accuracy
Addresses overconfidence in model outputs through regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Constraint Decoding enhances robustness
Algorithm aligns token-level predictions effectively
Combines with prompt engineering for gains
🔎 Similar Papers
No similar papers found.
J
Jui-Ming Yao
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
Hao-Yuan Chen
Hao-Yuan Chen
University of London, Mindify AI
Quantum Machine LearningQuantum UtilityLLM ReasoningLLM Agent
Z
Zi-Xian Tang
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
B
Bing-Jia Tan
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
S
Sheng-Wei Peng
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
B
Bing-Cheng Xie
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
Shun-Feng Su
Shun-Feng Su
Professor of EE, National Taiwan University of Science and Technology
intelligent systems