🤖 AI Summary
Existing SimCSE-style dropout-based data augmentation methods induce excessive embedding uniformity in patent representation learning, undermining semantic cohesion and structural fidelity.
Method: We propose a multi-view self-supervised contrastive learning framework explicitly leveraging patent section structure—treating the description, claims, and other sections as complementary views—and introduce section-aware augmentation strategies to replace random dropout, thereby modeling the intrinsic argumentative logic and task-specific structural properties of patents. This approach enables sentence encoders to learn paragraph-level semantic representations without human annotations.
Contribution/Results: Evaluated on a large-scale patent benchmark, our unsupervised variant achieves performance on par with or superior to supervised models relying on citation links or IPC labels, and significantly outperforms state-of-the-art methods on patent retrieval and classification tasks.
📝 Abstract
This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents'inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.