AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

📅 2024-02-14

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing LLM evaluation benchmarks lack systematic assessment of sequential reasoning capabilities—such as DFS, BFS, and binary search—under algorithmic scenarios requiring dynamic decision-making and stateful memory. Method: We propose AQA-Bench, the first interactive benchmark tailored for algorithmic tasks, introducing a novel real-time perception–policy updating evaluation protocol, algorithm-logic-driven dynamic prompting, and a multi-turn few-shot framework. Contributions/Results: Experiments across 12 open- and closed-source models reveal: (1) GPT-4 and Gemini substantially outperform open-source models; (2) incorporating merely a few optimal prior steps significantly boosts small-model performance, uncovering a counterintuitive “fewer examples, better performance” phenomenon; (3) model size exhibits no monotonic relationship with sequential reasoning ability—in some tasks, larger models perform worse. These findings challenge the prevalent “bigger is better” assumption and establish a new paradigm for algorithm-level LLM evaluation.

Technology Category

Application Category

📝 Abstract

This paper introduces AQA-Bench, a novel benchmark to assess the sequential reasoning capabilities of large language models (LLMs) in algorithmic contexts, such as depth-first search (DFS). The key feature of our evaluation benchmark lies in its interactive evaluation protocol -- for example, in DFS, the availability of each node's connected edge is contingent upon the model's traversal to that node, thereby necessitating the LLM's ability to effectively remember visited nodes and strategize subsequent moves. We comprehensively build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search, and to evaluate the sequential reasoning ability of 12 different LLMs. Our investigations reveal several interesting findings: (1) Closed-source models like GPT-4 and Gemini generally show strong sequential reasoning ability, significantly outperforming open-source LLMs. (2) Naively providing interactive examples may inadvertently hurt few-shot performance. (3) A very limited number of predecessor steps following the optimal policy can substantially boost small models' performance. (4) The scaling correlation between performance and model size is not always significant, sometimes even showcasing an inverse trend. We hope our study can catalyze future work on advancing the understanding and enhancement of LLMs' capabilities in sequential reasoning. The code is available at https://github.com/UCSC-VLAA/AQA-Bench.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' sequential reasoning in algorithmic tasks

Evaluating interactive reasoning with dynamic node traversal

Comparing performance of 14 LLMs across three algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive evaluation protocol for sequential reasoning

Benchmark with three algorithmic contexts

Optimal predecessor steps boost performance

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues