MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This work addresses error accumulation and catastrophic forgetting in test-time continual adaptation caused by texture bias. Inspired by the human visual system’s ability to disentangle shape and texture, the authors propose a plug-in sparse mixture-of-experts architecture. The method employs domain-aware routing to activate sparse experts that decouple domain-invariant structural features from domain-specific textures. Stability and controllability during continual adaptation are further enhanced through several key components: exponential moving average (EMA)-anchored reverse KL online policy distillation, spatially differentiable dropout, low- and high-rank bottleneck layers, and dynamic data augmentation. The approach achieves state-of-the-art performance across multiple benchmarks, including CIFAR-10/100-C and ImageNet-C for robust classification, as well as the Cityscapes→ACDC domain shift in semantic segmentation.
📝 Abstract
Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.
Problem

Research questions and friction points this paper is trying to address.

continual test-time adaptation
catastrophic forgetting
texture bias
non-stationary streams
domain shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation Sparsity Experts
Domain-Adaptive On-Policy Distillation
Continual Test-Time Adaptation
Spatial Differentiable Dropout
Mixture of Experts