Patent Representation Learning via Self-supervision

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing SimCSE-style dropout-based data augmentation methods induce excessive embedding uniformity in patent representation learning, undermining semantic cohesion and structural fidelity. Method: We propose a multi-view self-supervised contrastive learning framework explicitly leveraging patent section structure—treating the description, claims, and other sections as complementary views—and introduce section-aware augmentation strategies to replace random dropout, thereby modeling the intrinsic argumentative logic and task-specific structural properties of patents. This approach enables sentence encoders to learn paragraph-level semantic representations without human annotations. Contribution/Results: Evaluated on a large-scale patent benchmark, our unsupervised variant achieves performance on par with or superior to supervised models relying on citation links or IPC labels, and significantly outperforms state-of-the-art methods on patent retrieval and classification tasks.

Technology Category

Application Category

📝 Abstract
This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents'inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.
Problem

Research questions and friction points this paper is trying to address.

Overly uniform patent embeddings lose semantic cohesion in contrastive learning
Existing methods fail to leverage patents' inherent structural diversity across sections
Current approaches rely on brittle external annotations rather than self-supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning framework with multiple document views
Section-based augmentation replaces dropout for patents
Self-supervised method matches supervised baseline performance
🔎 Similar Papers
No similar papers found.
Y
You Zuo
Inria, Paris, France
K
Kim Gerdes
Université Paris-Saclay (LISN, CNRS), Orsay, France
É
Éric de la Clergerie
Inria, Paris, France
Benoît Sagot
Benoît Sagot
Directeur de recherches at Inria, head of the ALMAnaCH team
NLPLanguage ModellingLow-resource LanguagesMachine TranslationComputational Linguistics