Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning

📅 2026-01-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work proposes a task- and model-aware similarity metric framework to effectively quantify the similarity between wireless datasets, thereby facilitating model transfer, data selection, and synthetic data generation. The approach integrates unsupervised UMAP embeddings with Wasserstein and Euclidean distances, and further enhances label awareness by incorporating supervised UMAP along with a penalty mechanism for class imbalance. Evaluated on CSI compression and downlink beam prediction tasks, the proposed metric achieves Pearson correlation coefficients exceeding 0.85 with cross-dataset model performance, significantly outperforming conventional baselines. These results demonstrate its strong predictive capability for transfer performance and highlight its practical utility in simulation-to-reality alignment and task-oriented data generation.

Technology Category

Application Category

📝 Abstract

This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets, enabling applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, task-specific synthetic data generation, and informing decisions on model training/adaptation to new deployments. We evaluate candidate dataset distance metrics by how well they predict cross-dataset transferability: if two datasets have a small distance, a model trained on one should perform well on the other. We apply the framework on an unsupervised task, channel state information (CSI) compression, using autoencoders. Using metrics based on UMAP embeddings, combined with Wasserstein and Euclidean distances, we achieve Pearson correlations exceeding 0.85 between dataset distances and train-on-one/test-on-another task performance. We also apply the framework to a supervised beam prediction in the downlink using convolutional neural networks. For this task, we derive a label-aware distance by integrating supervised UMAP and penalties for dataset imbalance. Across both tasks, the resulting distances outperform traditional baselines and consistently exhibit stronger correlations with model transferability, supporting task-relevant comparisons between wireless datasets.

Problem

Research questions and friction points this paper is trying to address.

wireless dataset similarity

dataset distance

model transferability

sim2real comparison

task-aware evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dataset similarity

task-aware distance

UMAP embedding