🤖 AI Summary
This work investigates whether large language models (LLMs) possess compositional semantic parsing capability in the QALD task—i.e., whether they can systematically interpret and parse structurally complex questions based on known atomic components to generate correct SPARQL queries. Method: The authors introduce CompoST, the first benchmark explicitly designed for evaluating compositional language understanding, featuring a controlled dataset derived from DBpedia graph patterns and spanning three levels of structural complexity; it employs Lemon dictionary–driven naturalization and rigorous prompting/fine-tuning strategies. Results: Experiments reveal a sharp decline in macro-F1 with increasing structural deviation (0.45 → 0.09), with performance never exceeding 0.57 even on the simplest subset—demonstrating a fundamental limitation in LLMs’ compositional generalization. This study provides the first systematic characterization and quantification of compositional understanding bottlenecks in KBQA for LLMs.
📝 Abstract
Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they "understand" the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.