What Can String Probability Tell Us About Grammaticality?

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study investigates whether language models’ (LMs) string probabilities reflect implicit grammatical knowledge and how these probabilities relate to grammaticality. We propose a theoretically grounded framework linking grammaticality, semantic plausibility, and probability distributions over generated corpora, yielding three empirically testable predictions regarding LM probability behavior on minimal syntactic pairs. Methodologically, we integrate formal modeling, comparative probability analysis on minimal pairs, correlation analysis between LM scores and human grammaticality judgments, and evaluation of linear separability for grammaticality classification. Empirical evaluation across 280,000 English–Chinese sentence pairs shows that LM probabilities strongly correlate with human grammaticality judgments and exhibit significant probability differences within minimal pairs; however, grammatical and ungrammatical sentences are not linearly separable in the probability space. Our work provides a novel theoretical foundation and evaluation paradigm for probing LMs’ implicit syntactic knowledge.

Technology Category

Application Category

📝 Abstract

What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM's underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models' and humans' deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs' structural knowledge, and suggest directions for future work in LM grammatical evaluation.

Problem

Research questions and friction points this paper is trying to address.

Analyzing string probability's relation to grammatical knowledge in language models

Validating theoretical predictions about grammar-meaning-probability relationships empirically

Establishing theoretical grounding for evaluating LM structural knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical framework linking grammar meaning probability

Validated predictions using minimal pairs analysis

Empirical testing with English Chinese sentence pairs

🔎 Similar Papers

What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages