🤖 AI Summary
This work addresses the lack of systematic evaluation of large language models (LLMs) on data-flow reasoning—the procedural understanding of data movement, transformation, and persistence. To this end, we introduce FABLE, the first diagnostic benchmark for this capability. FABLE spans three domains—cooking, travel, and automation planning—and comprises 2,400 structured question-answer instances derived from eight canonical software-engineering-style data-flow tasks, pioneering the systematic transfer of program-analysis data-flow concepts to natural-language procedural text understanding. We propose a cross-domain, scalable evaluation framework integrating rule-driven modeling, multi-domain structured annotation, and a five-sample majority-voting protocol. Experiments on 8B-parameter models—including DeepSeek-R1, LLaMA 3.1, and Granite Code—reveal that specialized reasoning models achieve significantly higher accuracy but incur >20× latency overhead; conversely, general-purpose and code-specialized models perform near-chance, exposing fundamental deficiencies in state evolution modeling and dependency tracking.
📝 Abstract
Understanding how data moves, transforms, and persists, known as data flow, is fundamental to reasoning in procedural tasks. Despite their fluency in natural and programming languages, large language models (LLMs), although increasingly being applied to decisions with procedural tasks, have not been systematically evaluated for their ability to perform data-flow reasoning. We introduce FABLE, an extensible benchmark designed to assess LLMs' understanding of data flow using structured, procedural text. FABLE adapts eight classical data-flow analyses from software engineering: reaching definitions, very busy expressions, available expressions, live variable analysis, interval analysis, type-state analysis, taint analysis, and concurrency analysis. These analyses are instantiated across three real-world domains: cooking recipes, travel routes, and automated plans. The benchmark includes 2,400 question-answer pairs, with 100 examples for each domain-analysis combination. We evaluate three types of LLMs: a reasoning-focused model (DeepSeek-R1 8B), a general-purpose model (LLaMA 3.1 8B), and a code-specific model (Granite Code 8B). Each model is tested using majority voting over five sampled completions per prompt. Results show that the reasoning model achieves higher accuracy, but at the cost of over 20 times slower inference compared to the other models. In contrast, the general-purpose and code-specific models perform close to random chance. FABLE provides the first diagnostic benchmark to systematically evaluate data-flow reasoning and offers insights for developing models with stronger procedural understanding.