FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model Evaluation

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of large language models (LLMs) on data-flow reasoning—the procedural understanding of data movement, transformation, and persistence. To this end, we introduce FABLE, the first diagnostic benchmark for this capability. FABLE spans three domains—cooking, travel, and automation planning—and comprises 2,400 structured question-answer instances derived from eight canonical software-engineering-style data-flow tasks, pioneering the systematic transfer of program-analysis data-flow concepts to natural-language procedural text understanding. We propose a cross-domain, scalable evaluation framework integrating rule-driven modeling, multi-domain structured annotation, and a five-sample majority-voting protocol. Experiments on 8B-parameter models—including DeepSeek-R1, LLaMA 3.1, and Granite Code—reveal that specialized reasoning models achieve significantly higher accuracy but incur >20× latency overhead; conversely, general-purpose and code-specialized models perform near-chance, exposing fundamental deficiencies in state evolution modeling and dependency tracking.

Technology Category

Application Category

📝 Abstract

Understanding how data moves, transforms, and persists, known as data flow, is fundamental to reasoning in procedural tasks. Despite their fluency in natural and programming languages, large language models (LLMs), although increasingly being applied to decisions with procedural tasks, have not been systematically evaluated for their ability to perform data-flow reasoning. We introduce FABLE, an extensible benchmark designed to assess LLMs' understanding of data flow using structured, procedural text. FABLE adapts eight classical data-flow analyses from software engineering: reaching definitions, very busy expressions, available expressions, live variable analysis, interval analysis, type-state analysis, taint analysis, and concurrency analysis. These analyses are instantiated across three real-world domains: cooking recipes, travel routes, and automated plans. The benchmark includes 2,400 question-answer pairs, with 100 examples for each domain-analysis combination. We evaluate three types of LLMs: a reasoning-focused model (DeepSeek-R1 8B), a general-purpose model (LLaMA 3.1 8B), and a code-specific model (Granite Code 8B). Each model is tested using majority voting over five sampled completions per prompt. Results show that the reasoning model achieves higher accuracy, but at the cost of over 20 times slower inference compared to the other models. In contrast, the general-purpose and code-specific models perform close to random chance. FABLE provides the first diagnostic benchmark to systematically evaluate data-flow reasoning and offers insights for developing models with stronger procedural understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' data-flow reasoning in procedural tasks

Assessing data-flow analysis across real-world domains

Comparing performance of different LLM types on FABLE

Innovation

Methods, ideas, or system contributions that make the work stand out.

FABLE benchmark for LLM data-flow evaluation

Adapts eight software engineering analyses

Tests models across three real-world domains

🔎 Similar Papers

No similar papers found.