🤖 AI Summary
Existing CLIP-style graph–text alignment methods suffer from two key limitations: (1) they enforce rigid one-to-one mapping assumptions, ignoring intrinsic many-to-many semantic relationships in graphs; and (2) they rely on static alignment objectives, yielding poor robustness under noisy supervision. This paper proposes ADAligner—the first dynamic supervision-aware graph–text alignment framework. ADAligner adaptively selects either subgraph-level many-to-many alignment (for high-quality data) or node-level one-to-one alignment (under noise) via batch-level alignment reliability estimation and dynamic filtering of low-confidence samples. It introduces a soft subgraph alignment loss and a tunable optimization objective, with theoretical guarantees establishing it as a stable negative-feedback system. Evaluated on nine text-attributed graph datasets, ADAligner achieves significant gains in zero-shot/few-shot classification, link prediction, and cross-modal retrieval, accelerates training by 2–3×, and demonstrates exceptional robustness to label noise.
📝 Abstract
Pre-training Graph Foundation Models (GFMs) on text-attributed graphs (TAGs) is central to web-scale applications such as search, recommendation, and knowledge discovery. However, existing CLIP-style graph-text aligners face two key limitations: they assume strict one-to-one correspondences between nodes and texts, overlooking the inherent many-to-many relations in real-world graphs; and they rely on static alignment objectives that cannot adapt to varying data quality, making them brittle under noisy supervision. Together, these limitations expose a core dilemma: embracing expressive many-to-many alignment amplifies noise, while reverting to strict one-to-one strategies sacrifices semantic diversity and fails to handle inherently mismatched pairs. To address these challenges, we propose ADAligner, a dynamic, quality-aware graph-text alignment framework that dynamically adjusts between expressive many-to-many and conservative one-to-one objectives according to supervision quality. ADAligner estimates batch-level alignment reliability in real time and adapts its optimization accordingly, promoting soft, subgraph-level many-to-many alignment when supervision is clean, while emphasizing reliable one-to-one alignment by dynamically filtering low-confidence pairs under noise. Theoretically, we prove that this dynamic mechanism forms a stable negative feedback process, ensuring convergence and robustness. Comprehensive experiments on nine diverse TAG datasets demonstrate that ADAligner consistently outperforms prior graph-text aligners on zero-/few-shot node classification, link prediction and cross-modal retrieval tasks. It maintains strong robustness under noisy supervision and accelerates pre-training by approximately 2 to 3 times compared to multimodal baselines, establishing a scalable and reliable foundation for graph-text representation learning in real-world web environments.