Categorical distance correlation under general encodings and its application to high-dimensional feature screening

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of measuring nonlinear dependence and performing feature screening for high-dimensional categorical variables with respect to a response variable. It extends distance correlation to categorical data under general encoding schemes—such as one-hot and Helmert coding—and, for the first time within the distance correlation framework, incorporates inter-category spacing information to enhance dependence measurement. Theoretically, the proposed method is shown to possess the sure screening property in high-dimensional settings. Empirically, its effectiveness is demonstrated through comprehensive simulations and an analysis of the 2018 General Social Survey data, which collectively confirm that the approach offers substantial advantages in high-dimensional categorical feature screening across different encoding strategies.

Technology Category

Application Category

📝 Abstract
In this paper, we extend distance correlation to categorical data with general encodings, such as one-hot encoding for nominal variables and semicircle encoding for ordinal variables. Unlike existing methods, our approach leverages the spacing information between categories, which enhances the performance of distance correlation. Two estimates including the maximum likelihood estimate and a bias-corrected estimate are given, together with their limiting distributions under the null and alternative hypotheses. Furthermore, we establish the sure screening property for high-dimensional categorical data under mild conditions. We conduct a simulation study to compare the performance of different encodings, and illustrate their practical utility using the 2018 General Social Survey data.
Problem

Research questions and friction points this paper is trying to address.

categorical data
distance correlation
feature screening
high-dimensional data
encoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

distance correlation
categorical data
general encodings
sure screening
high-dimensional feature screening
🔎 Similar Papers
2020-02-18Computer Vision and Pattern RecognitionCitations: 6
2021-06-14IEEE Transactions on Visualization and Computer GraphicsCitations: 12