Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surprisal-based syntactic evaluation methods—Wilcox et al.’s direct minimal-pair contrast approach versus Lan et al.’s “difference-of-differences” method—yield inconsistent diagnostic conclusions regarding LLMs’ syntactic competence. Method: We systematically compare their transparency and accuracy, focusing on filler-gap dependencies and parasitic gap constructions. We propose an enhanced direct minimal-pair paradigm, designing eight fine-grained parasitic gap stimuli and integrating wh-island effect analysis to evaluate GPT-2. Contribution/Results: Our refined method significantly improves diagnostic transparency and reveals that metric choice critically determines syntactic competence assessment. Crucially, GPT-2 consistently satisfies parasitic gap licensing constraints across all four test conditions—including island-sensitive environments—demonstrating robust, deep syntactic generalization beyond surface-level patterns. These findings underscore the importance of methodological rigor in probing linguistic knowledge in LLMs and provide evidence that GPT-2 encodes abstract, hierarchical syntactic principles.

Technology Category

Application Category

📝 Abstract
Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the "wh-effect") to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM's syntactic competence.
Problem

Research questions and friction points this paper is trying to address.

Direct minimal pair analysis assesses LLM syntactic competence
Evaluating GPT-2's knowledge of filler-gap dependencies in parasitic gaps
Comparing diagnostic transparency of different syntactic evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct minimal pair analysis for syntactic assessment
Full permutation paradigm of parasitic gap stimuli
Wilcox-style wh-effect analysis for GPT-2 evaluation