Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the generalization bottleneck of small language models (SLMs) in SHACL-based RDF graph extraction—caused by long-tailed distributions of data types and object properties—this paper proposes a fine-tuning framework focused on balanced learning for rare properties. Methodologically, it introduces hierarchical sampling with minimum-frequency constraints to ensure sufficient exposure to infrequent properties; integrates a weighted loss function, template-driven synthetic data augmentation, and dataset expansion to mitigate class imbalance. Experiments demonstrate substantial improvements in SLM robustness and fairness for tail-property extraction, achieving state-of-the-art performance across multiple SHACL-constrained tasks. The work open-sources a benchmark dataset, implementation code, and comprehensive experimental results, establishing a reproducible methodology and practical paradigm for shape-aware lightweight semantic parsing.

Technology Category

Application Category

📝 Abstract
Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.
Problem

Research questions and friction points this paper is trying to address.

SLMs struggle with rare properties in relation extraction
Addressing long-tail distribution challenges in RDF graph extraction
Improving generalization for both datatype and object properties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stratified sampling addresses rare property distribution
Weighted loss optimizes model for unbalanced properties
Template-based synthetic data augments training sets
🔎 Similar Papers
No similar papers found.
C
Célian Ringwald
Univ. Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France
F
Fabien L. Gandon
Univ. Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France
Catherine Faron
Catherine Faron
Professor, Univ. Côte d'Azur
Semantic WebKnowledge Representation and ReasoningOntologiesArtificial Intelligence
Franck Michel
Franck Michel
Université Côte d'Azur, CNRS, Inria, I3S
Knowledge GraphsSemantic WebLinked DataOpen DataBiodiversity
H
Hanna Abi Akl
Univ. Côte d’Azur, Inria, CNRS, I3S, Sophia-Antipolis, France