Generalization Bounds for Transformer Channel Decoders

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of theoretical guarantees for the generalization performance of Transformer-based channel decoders such as ECCT. From the perspective of statistical learning theory, it establishes the first upper bound on the generalization error of ECCT by analyzing bit-wise Rademacher complexity, thereby linking multiplicative noise estimation error to bit error rate. By incorporating covering number estimates, the analysis reveals that parity-check-matrix-guided masked attention induces sparsity and reduces model complexity, yielding a tighter generalization bound. The framework applies uniformly to both single- and multi-layer ECCT architectures and explicitly characterizes the dependence of generalization error on code length, model size, and training sample size. Theoretically, this confirms that sparse attention structures enhance generalization capability.

Technology Category

Application Category

📝 Abstract
Transformer channel decoders, such as the Error Correction Code Transformer (ECCT), have shown strong empirical performance in channel decoding, yet their generalization behavior remains theoretically unclear. This paper studies the generalization performance of ECCT from a learning-theoretic perspective. By establishing a connection between multiplicative noise estimation errors and bit-error-rate (BER), we derive an upper bound on the generalization gap via bit-wise Rademacher complexity. The resulting bound characterizes the dependence on code length, model parameters, and training set size, and applies to both single-layer and multi-layer ECCTs. We further show that parity-check-based masked attention induces sparsity that reduces the covering number, leading to a tighter generalization bound. To the best of our knowledge, this work provides the first theoretical generalization guarantees for this class of decoders.
Problem

Research questions and friction points this paper is trying to address.

generalization
Transformer
channel decoding
bit-error-rate
Rademacher complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

generalization bound
Transformer decoder
Rademacher complexity
masked attention
channel coding
🔎 Similar Papers
No similar papers found.
Q
Qinshan Zhang
Tsinghua Shenzhen International Graduate School, Shenzhen, China, and Pengcheng Laboratory, Shenzhen, China
B
Bin Chen
Harbin Institute of Technology (Shenzhen), University Town of Shenzhen, Nanshan District, Shenzhen, 518055, China
Yong Jiang
Yong Jiang
Tsinghua University
Large Language ModelNatural Language ProcessingMachine Learning
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security