🤖 AI Summary
This work proposes a task- and model-aware similarity metric framework to effectively quantify the similarity between wireless datasets, thereby facilitating model transfer, data selection, and synthetic data generation. The approach integrates unsupervised UMAP embeddings with Wasserstein and Euclidean distances, and further enhances label awareness by incorporating supervised UMAP along with a penalty mechanism for class imbalance. Evaluated on CSI compression and downlink beam prediction tasks, the proposed metric achieves Pearson correlation coefficients exceeding 0.85 with cross-dataset model performance, significantly outperforming conventional baselines. These results demonstrate its strong predictive capability for transfer performance and highlight its practical utility in simulation-to-reality alignment and task-oriented data generation.
📝 Abstract
This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets, enabling applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, task-specific synthetic data generation, and informing decisions on model training/adaptation to new deployments. We evaluate candidate dataset distance metrics by how well they predict cross-dataset transferability: if two datasets have a small distance, a model trained on one should perform well on the other. We apply the framework on an unsupervised task, channel state information (CSI) compression, using autoencoders. Using metrics based on UMAP embeddings, combined with Wasserstein and Euclidean distances, we achieve Pearson correlations exceeding 0.85 between dataset distances and train-on-one/test-on-another task performance. We also apply the framework to a supervised beam prediction in the downlink using convolutional neural networks. For this task, we derive a label-aware distance by integrating supervised UMAP and penalties for dataset imbalance. Across both tasks, the resulting distances outperform traditional baselines and consistently exhibit stronger correlations with model transferability, supporting task-relevant comparisons between wireless datasets.