BoTTA: Benchmarking on-device Test Time Adaptation

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing test-time adaptation (TTA) research lacks systematic investigation under resource constraints typical of mobile/edge devices. Method: This work introduces the first TTA benchmark tailored for edge-device scenarios, addressing practical challenges including few-shot learning, limited categories, and concurrent inter- and intra-sample distribution shifts. We propose a four-dimensional constrained evaluation framework, formalize and empirically assess periodic adaptation paradigms, and conduct system-level measurements (memory footprint, latency, accuracy) on real-world platforms (e.g., Raspberry Pi). We integrate mainstream algorithms—including SHOT—for quantitative analysis of performance–overhead trade-offs under strict resource limits. Contribution/Results: Empirical results reveal severe generalization degradation of current TTA methods under few-shot and unseen-category settings; SHOT incurs up to 1.08× higher peak memory versus baseline. This work fills a critical gap in edge-oriented TTA benchmarking and provides empirically grounded design principles and deployment guidelines for lightweight TTA.

Technology Category

Application Category

📝 Abstract
The performance of deep learning models depends heavily on test samples at runtime, and shifts from the training data distribution can significantly reduce accuracy. Test-time adaptation (TTA) addresses this by adapting models during inference without requiring labeled test data or access to the original training set. While research has explored TTA from various perspectives like algorithmic complexity, data and class distribution shifts, model architectures, and offline versus continuous learning, constraints specific to mobile and edge devices remain underexplored. We propose BoTTA, a benchmark designed to evaluate TTA methods under practical constraints on mobile and edge devices. Our evaluation targets four key challenges caused by limited resources and usage conditions: (i) limited test samples, (ii) limited exposure to categories, (iii) diverse distribution shifts, and (iv) overlapping shifts within a sample. We assess state-of-the-art TTA methods under these scenarios using benchmark datasets and report system-level metrics on a real testbed. Furthermore, unlike prior work, we align with on-device requirements by advocating periodic adaptation instead of continuous inference-time adaptation. Experiments reveal key insights: many recent TTA algorithms struggle with small datasets, fail to generalize to unseen categories, and depend on the diversity and complexity of distribution shifts. BoTTA also reports device-specific resource use. For example, while SHOT improves accuracy by $2.25 imes$ with $512$ adaptation samples, it uses $1.08 imes$ peak memory on Raspberry Pi versus the base model. BoTTA offers actionable guidance for TTA in real-world, resource-constrained deployments.
Problem

Research questions and friction points this paper is trying to address.

Evaluating TTA methods for mobile and edge devices
Addressing limited resources and diverse distribution shifts
Assessing device-specific resource use and adaptation efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for on-device test time adaptation
Evaluates TTA under mobile resource constraints
Advocates periodic adaptation over continuous
🔎 Similar Papers
No similar papers found.
M
Michal Danilowski
University of Birmingham, Birmingham, United Kingdom
Soumyajit Chatterjee
Soumyajit Chatterjee
Senior Research Scientist, Bell Labs and Visiting Scholar, University of Cambridge
Pervasive ComputingApplied Machine Learning
A
Abhirup Ghosh
University of Birmingham, Birmingham, United Kingdom