🤖 AI Summary
This study systematically investigates the impact of natural versus synthetic data on syntactic generalization in large language models, focusing on the passive alternation phenomenon in French and Italian. Leveraging the Blackbird linguistic matrix framework, the authors fine-tune and evaluate models using real-world corpora from Universal Dependencies alongside synthetically generated data derived from structured templates. Results demonstrate that models trained exclusively on synthetic data achieve high performance on in-distribution evaluations but fail to generalize effectively to natural language contexts. In contrast, models trained on natural data exhibit robust and significantly superior performance across both synthetic and natural test sets. This work provides the first cross-lingual evidence that natural data are indispensable for acquiring abstract grammatical regularities, particularly in the domain of passive voice constructions.
📝 Abstract
This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs' syntactic and semantic knowledge.