🤖 AI Summary
This work proposes a training-free contrastive decoding method to address the inefficiency and errors in large language model (LLM) inference caused by low-confidence tokens. During decoding, the approach dynamically identifies low-confidence tokens and constructs a minimal placeholder reference distribution composed of high-confidence alternatives. A confidence-driven subtractive contrastive mechanism then precisely intervenes to suppress unreliable predictions. This is the first application of such a mechanism to LLM inference, achieving significant improvements in accuracy and substantial reductions in output length across multiple mathematical reasoning benchmarks. Notably, the method introduces negligible overhead in KV cache usage, enabling highly efficient and lightweight inference optimization.
📝 Abstract
Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.