How to Make the Most of LLMs' Grammatical Knowledge for Acceptability Judgments

📅 2024-08-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Large language models (LLMs) often fail to reliably activate their implicit syntactic knowledge in sentence acceptability judgment tasks. Method: We propose two novel, complementary approaches—*in-template log-probability readout* (in-template LP) and *yes/no-style probability computation*—which jointly integrate prompt engineering, probability calibration, and principled template design. Contribution/Results: Our ensemble method achieves the first significant cross-lingual performance gain in this setting, outperforming conventional raw-probability baselines on English and Chinese minimal-pair benchmarks. Empirical results demonstrate substantial average accuracy improvements across bilingual acceptability judgment tasks. The approach offers a more robust, interpretable, and linguistically grounded paradigm for probing and evaluating LLMs’ grammatical competence.

Technology Category

Application Category

📝 Abstract

The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where the LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is more acceptable. Conventional approaches directly compare sentence probabilities assigned by LMs, but recent large language models (LLMs) are trained to perform tasks via prompting, and thus, the raw probabilities they assign may not fully reflect their grammatical knowledge. In this study, we attempt to derive more accurate acceptability judgments from LLMs using prompts and templates. Through extensive experiments in English and Chinese, we compare nine judgment methods and find two of them, a probability readout method -- in-template LP and a prompt-based method -- Yes/No probability computing, achieve higher accuracy than the conventional ones. Our analysis reveals that these methods excel in different linguistic phenomena, suggesting they access different aspects of LLMs' knowledge. We also find that ensembling the two methods outperforms single methods. Consequently, we recommend these techniques, either individually or ensembled, as more effective alternatives to conventional approaches for assessing grammatical knowledge in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving LLMs' grammatical acceptability judgments

Comparing nine judgment methods for accuracy

Ensembling methods outperforms single techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-based Yes/No probability computing

In-template probability readout method

Ensembling techniques for higher accuracy

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval