A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Model training on edge devices is hindered by low throughput, limited storage, and heterogeneous data importance, resulting in poor data utilization. To address this, we propose Titan, a two-stage dynamic data selection framework. Titan introduces a novel synergistic mechanism combining coarse-grained pre-screening with fine-grained optimal selection; theoretically derives and implements a fine-grained importance-aware greedy selection strategy; and designs an online importance estimation module coupled with a lightweight modeling component, enabling zero-resource-conflict pipelining of data selection and model training via asynchronous execution. Evaluated on real-edge hardware, Titan reduces training time by 43%, improves final model accuracy by 6.2%, and incurs only negligible overhead in latency, memory footprint, and energy consumption.

Technology Category

Application Category

📝 Abstract

The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grained manner.In the second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, {sf Titan} leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate {sf Titan} on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that {sf Titan} achieves up to $43%$ reduction in training time and $6.2%$ increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.

Problem

Research questions and friction points this paper is trying to address.

Improves on-device ML training data utilization

Selects high-importance data from streaming sources

Reduces training time while boosting model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage data selection for edge training

Pipeline co-execution of selection and training

Optimal data batch identification strategy

🔎 Similar Papers

No similar papers found.