Mixture of Attention Yields Accurate Results for Tabular Data

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges posed by highly heterogeneous features in tabular data and the limited modeling capacity of existing Transformers, this paper proposes the MOA encoder-decoder framework. Methodologically, it introduces: (1) a novel Mixture-of-Attention (MOA) mechanism that achieves deep fusion of heterogeneous features via multi-branch parallel attention and feature averaging, with minimal parameter overhead; (2) dynamic weighted collaborative learning and label-aware cross-attention to jointly model intra-instance structural relationships and inter-instance semantic associations; and (3) dynamic consistency constraints combined with feature normalization–residual joint regularization to enhance representation robustness. Extensive experiments across diverse tabular classification and regression benchmarks demonstrate that MOA consistently outperforms state-of-the-art Transformer-based methods, delivering significant improvements in both predictive accuracy and generalization stability.

Technology Category

Application Category

📝 Abstract
Tabular data inherently exhibits significant feature heterogeneity, but existing transformer-based methods lack specialized mechanisms to handle this property. To bridge the gap, we propose MAYA, an encoder-decoder transformer-based framework. In the encoder, we design a Mixture of Attention (MOA) that constructs multiple parallel attention branches and averages the features at each branch, effectively fusing heterogeneous features while limiting parameter growth. Additionally, we employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations. In the decoder stage, cross-attention is utilized to seamlessly integrate tabular data with corresponding label features. This dual-attention mechanism effectively captures both intra-instance and inter-instance interactions. We evaluate the proposed method on a wide range of datasets and compare it with other state-of-the-art transformer-based methods. Extensive experiments demonstrate that our model achieves superior performance among transformer-based methods in both tabular classification and regression tasks.
Problem

Research questions and friction points this paper is trying to address.

Handles feature heterogeneity in tabular data
Proposes Mixture of Attention for robust representations
Improves transformer-based methods for classification and regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Attention for feature fusion
Dynamic consistency weight constraint
Cross-attention for label integration
🔎 Similar Papers
No similar papers found.
Xuechen Li
Xuechen Li
Stanford University
machine learningartificial intelligencestatistics
Y
Yupeng Li
Shanghai Warpdrive Technology Co.Ltd, Floor 2, No.57 Boxia Road, Pudong, Shanghai
J
Jian Liu
Shanghai Warpdrive Technology Co.Ltd, Floor 2, No.57 Boxia Road, Pudong, Shanghai
X
Xiaolin Jin
Shanghai Warpdrive Technology Co.Ltd, Floor 2, No.57 Boxia Road, Pudong, Shanghai
T
Tian Yang
Shanghai Warpdrive Technology Co.Ltd, Floor 2, No.57 Boxia Road, Pudong, Shanghai
X
Xin Hu
Shanghai Warpdrive Technology Co.Ltd, Floor 2, No.57 Boxia Road, Pudong, Shanghai