Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

๐Ÿ“… 2025-10-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing theoretical analyses of Transformers are limited to modeling tree-structured dependencies (i.e., single-parent relationships), whereas real-world sequences often arise from directed acyclic graphs (DAGs) with multi-parent structures. Method: We propose the first provably DAG-recoverable theoretical framework for Transformers. We introduce kernel-guided mutual information (KG-MI), a novel information-theoretic measure built upon f-divergence, and design a training objective where each attention head learns an independent marginal transition kernel to capture distinct parentโ€“child dependencies. Contribution/Results: We establish the first global convergence guarantee for single-layer multi-head Transformers in polynomial time. Crucially, when instantiated with KL divergence, the learned attention scores provably recover the true DAGโ€™s adjacency matrix exactly. Empirical evaluation confirms strong alignment between theoretical predictions and actual structural recovery accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.
Problem

Research questions and friction points this paper is trying to address.

Extending transformer guarantees from trees to DAGs
Learning multiple parent-child dependencies with attention heads
Provably recovering graph structures via kernel-guided mutual information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel-guided mutual information metric for DAGs
Multi-head attention with distinct transition kernels
Gradient ascent training for global optimum convergence
๐Ÿ”Ž Similar Papers
2024-07-04IEEE Transactions on Knowledge and Data EngineeringCitations: 8
2024-07-13arXiv.orgCitations: 36
Y
Yuan Cheng
ISEP and Department of Mathematics, National University of Singapore, Singapore
Y
Yu Huang
Department of Statistics and Data Science, University of Pennsylvania, USA
Z
Zhe Xiong
Independent Researcher
Y
Yingbin Liang
Department of Electrical and Computer Engineering, The Ohio State University, USA
Vincent Y. F. Tan
Vincent Y. F. Tan
Professor, Department of Mathematics, National University of Singapore
Information TheoryMachine LearningSignal Processing