Formalising lexical and syntactic diversity for data sampling in French

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Constructing high-diversity French text datasets is computationally expensive when relying on optimal diversity sampling. Method: We propose a heuristic sampling approach that jointly quantifies lexical diversity via token frequency statistics and syntactic diversity via dependency tree complexity. Contribution/Results: Our study is the first to systematically demonstrate that the correlation between lexical and syntactic diversity in French is highly contingent on both corpus characteristics and metric selection—challenging the universality of generic diversity measures. Experiments across multiple French corpora show that our method significantly outperforms random sampling. While lexical diversity can partially proxy syntactic diversity, their correlation strength varies markedly across datasets and metrics. The proposed framework enables low-cost, reproducible construction of lexically and syntactically diverse French datasets, offering a principled methodology for resource-efficient NLP data curation.

Technology Category

Application Category

📝 Abstract
Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity -- lexical and syntactic -- correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.
Problem

Research questions and friction points this paper is trying to address.

French Linguistic Diversity
Lexical and Syntactic Analysis
Diverse Dataset Creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intelligent Method
Linguistic Diversity
Syntactic Variability
L
Louis Estève
Université Paris-Saclay, CNRS, LISN, Orsay, France
M
Manon Scholivet
Université Paris-Saclay, CNRS, LISN, Orsay, France
Agata Savary
Agata Savary
Université Paris-Saclay, France