SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing contrastive learning methods for long structured document retrieval neglect document structure and element-level semantics. To address this, we propose a structure-aware contrastive learning framework: (1) a structural encoding mechanism that explicitly models hierarchical document structure, and (2) a masked element alignment strategy enabling fine-grained semantic matching. We introduce StructDoc—the first large-scale, human-annotated dataset with rich structural labels—filling a critical gap in high-quality, structure-labeled benchmarks. Extensive experiments on state-of-the-art pretrained language models (e.g., BGE-M3) demonstrate consistent gains, improving NDCG@10 by 3.88 percentage points (73.96% → 77.84%). Industrial-scale online A/B testing further validates real-world effectiveness. Our core contributions are: (i) a novel structure-aware contrastive learning paradigm; (ii) an element-level semantic alignment mechanism; and (iii) the open-sourced StructDoc benchmark—a foundational resource for structured document understanding and retrieval.

Technology Category

Application Category

📝 Abstract
In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96% to 77.84% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.
Problem

Research questions and friction points this paper is trying to address.

Improving retrieval of long structured documents lacking structural metadata
Leveraging structural features and element-level semantics effectively
Addressing the absence of datasets with rich structural annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-aware learning preserves semantic hierarchies
Masked element alignment enables fine-grained discrimination
Releases dataset with rich structural annotations
🔎 Similar Papers
No similar papers found.
X
Xinhao Huang
HKUST (Guangzhou), Guangzhou, China
Z
Zhibo Ren
Alibaba Group, Hangzhou, China
Y
Yipeng Yu
Alibaba Group, Hangzhou, China
Y
Ying Zhou
Zhejiang Lab, Hangzhou, China
Zulong Chen
Zulong Chen
Director, Alibaba Group
Machine LearningLarge Language ModelSearch&RecommendationNLP
Zeyi Wen
Zeyi Wen
Assistant Professor at HKUST(Guangzhou)
Efficient LLMsMLSysHPOHPC