๐ค AI Summary
Semantic matching of text requires jointly modeling hierarchical syntactic structure and fine-grained semantic distinctions, yet prevailing pretrained language models struggle to capture structured, cross-sentence interactions. To address this, we propose a context-aware dual-graph encoding framework: (1) constructing bilingual semantic graphs by integrating dependency parsing and topic modeling; (2) propagating structural features via Graph Isomorphism Networks (GIN); and (3) introducing a joint node-level and graph-level contrastive learning objective, enhanced by explicit and implicit negative sampling to refine the representation space. This work is the first to synergistically integrate structure-aware graph encoding with hierarchical contrastive learning for semantic matching. Evaluated on three legal document matching benchmarks and an academic plagiarism detection dataset, our method achieves state-of-the-art performanceโe.g., 86.7% F1 on legal provision matching, representing an absolute improvement of 6.2%.
๐ Abstract
Text semantic matching requires nuanced understanding of both structural relationships and fine-grained semantic distinctions. While pre-trained language models excel at capturing token-level interactions, they often overlook hierarchical structural patterns and struggle with subtle semantic discrimination. In this paper, we proposed StructCoh, a graph-enhanced contrastive learning framework that synergistically combines structural reasoning with representation space optimization. Our approach features two key innovations: (1) A dual-graph encoder constructs semantic graphs via dependency parsing and topic modeling, then employs graph isomorphism networks to propagate structural features across syntactic dependencies and cross-document concept nodes. (2) A hierarchical contrastive objective enforces consistency at multiple granularities: node-level contrastive regularization preserves core semantic units, while graph-aware contrastive learning aligns inter-document structural semantics through both explicit and implicit negative sampling strategies. Experiments on three legal document matching benchmarks and academic plagiarism detection datasets demonstrate significant improvements over state-of-the-art methods. Notably, StructCoh achieves 86.7% F1-score (+6.2% absolute gain) on legal statute matching by effectively identifying argument structure similarities.