Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
Existing test-time adaptation methods for autoregressive models lack a unified theoretical foundation based on entropy minimization, often relying on heuristic strategies that limit their performance. This work proposes the first rigorous entropy minimization framework tailored to autoregressive models, formulating adaptation as a joint optimization of token-level policy gradient loss and entropy loss, thereby integrating pseudo-labeling with reinforcement learning principles. The framework subsumes prior approaches as special cases and demonstrates consistent performance gains across more than twenty domain-shift scenarios—including noise, accent variation, and multilingual conditions—when applied to the Whisper speech recognition model.
📝 Abstract
Test-Time Adaptation (TTA) via entropy minimization (EM) has proven effective for classification tasks, yet its application to generative autoregressive models remains theoretically fragmented. Existing approaches typically rely on distinct heuristics, such as teacher forcing with pseudo labels or policy-gradient-based reinforcement learning, without a unified mathematical foundation. In this work, we resolve this discrepancy by deriving a rigorous formulation of EM tailored to autoregressive models. We show that the exact objective naturally decomposes into a token-level policy gradient loss and a token-level entropy loss, and we reinterpret prior methods as partial realizations of this unified formulation. Using Whisper ASR as a testbed, we demonstrate that our approach consistently improves performance across more than 20 diverse domains, including acoustic noise, accents, and multilingual settings.
Problem

Research questions and friction points this paper is trying to address.

Test-Time Adaptation
Entropy Minimization
Autoregressive Models
Theoretical Foundation
Generative Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Adaptation
Entropy Minimization
Autoregressive Models
Policy Gradient
Whisper ASR
🔎 Similar Papers
2024-07-17Trans. Mach. Learn. Res.Citations: 2