C$^{2}$TC: A Training-Free Framework for Efficient Tabular Data Condensation

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and neglect of heterogeneous features and class imbalance in existing tabular data condensation methods. We propose the first training-free framework for tabular data condensation, formulating the condensation objective as a class-adaptive clustering assignment problem that jointly optimizes class allocation and feature representation. To efficiently solve this NP-hard problem, we introduce a Hybrid Categorical Feature Encoding (HCFE) scheme coupled with a heuristic local search algorithm (HFILS), leveraging soft assignment and intra-class clustering strategies. Extensive experiments on ten real-world datasets demonstrate that our method achieves at least two orders of magnitude speedup over state-of-the-art approaches while delivering superior performance on downstream tasks.

Technology Category

Application Category

📝 Abstract
Tabular data is the primary data format in industrial relational databases, underpinning modern data analytics and decision-making. However, the increasing scale of tabular data poses significant computational and storage challenges to learning-based analytical systems. This highlights the need for data-efficient learning, which enables effective model training and generalization using substantially fewer samples. Dataset condensation (DC) has emerged as a promising data-centric paradigm that synthesizes small yet informative datasets to preserve data utility while reducing storage and training costs. However, existing DC methods are computationally intensive due to reliance on complex gradient-based optimization. Moreover, they often overlook key characteristics of tabular data, such as heterogeneous features and class imbalance. To address these limitations, we introduce C$^{2}$TC (Class-Adaptive Clustering for Tabular Condensation), the first training-free tabular dataset condensation framework that jointly optimizes class allocation and feature representation, enabling efficient and scalable condensation. Specifically, we reformulate the dataset condensation objective into a novel class-adaptive cluster allocation problem (CCAP), which eliminates costly training and integrates adaptive label allocation to handle class imbalance. To solve the NP-hard CCAP, we develop HFILS, a heuristic local search that alternates between soft allocation and class-wise clustering to efficiently obtain high-quality solutions. Moreover, a hybrid categorical feature encoding (HCFE) is proposed for semantics-preserving clustering of heterogeneous discrete attributes. Extensive experiments on 10 real-world datasets demonstrate that C$^{2}$TC improves efficiency by at least 2 orders of magnitude over state-of-the-art baselines, while achieving superior downstream performance.
Problem

Research questions and friction points this paper is trying to address.

tabular data
dataset condensation
class imbalance
heterogeneous features
data efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
tabular data condensation
class-adaptive clustering
heterogeneous features
class imbalance
🔎 Similar Papers
No similar papers found.
S
Sijia Xu
University of New South Wales, Australia
Fan Li
Fan Li
University of New South Wales
graph miningdata-centric artificial intelligence
Xiaoyang Wang
Xiaoyang Wang
University of New South Wales (UNSW)
databasedata mininggraph processing
Z
Zhengyi Yang
University of New South Wales, Australia
X
Xuemin Lin
Shanghai Jiao Tong University, China