GroundCocoa: A Benchmark for Evaluating Compositional&Conditional Reasoning in Language Models

📅 2024-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite strong performance on standard benchmarks, large language models (LLMs) exhibit significant deficits in fundamental cognitive tasks—particularly compositional and conditional reasoning. Method: We introduce the first multiple-choice reasoning benchmark grounded in realistic flight booking scenarios, explicitly coupling compositional and conditional reasoning within embodied, lexically diverse, real-world tasks. The benchmark is constructed from real flight data to generate semantically rich, constraint-interwoven test instances, and is evaluated using zero-shot and few-shot prompting strategies. Contribution/Results: Experiments reveal that even the state-of-the-art GPT-4 Turbo achieves only 67% accuracy—substantially below human performance—exposing inherent limitations of mainstream LLMs in structured conditional reasoning. This work pioneers the joint evaluation of these two core reasoning capabilities within a realistic application domain, establishing a novel dimension for LLM assessment paradigms.

Technology Category

Application Category

📝 Abstract
The rapid progress of large language models (LLMs) has seen them excel and frequently surpass human performance on standard benchmarks. This has enabled many downstream applications, such as LLM agents, to rely on their reasoning to address complex task requirements. However, LLMs are known to unexpectedly falter in simple tasks and under seemingly straightforward circumstances - underscoring the need for better and more diverse evaluation setups to measure their true capabilities. To this end, we choose to study compositional and conditional reasoning, two aspects that are central to human cognition, and introduce GroundCocoa - a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
Problem

Research questions and friction points this paper is trying to address.

Evaluating compositional reasoning in LLMs
Assessing conditional reasoning in LLMs
Benchmarking LLMs on real-world flight booking tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces GroundCocoa benchmark
Focuses on compositional conditional reasoning
Evaluates LLMs in flight booking tasks
🔎 Similar Papers
No similar papers found.