🤖 AI Summary
Large language models (LLMs) still face fundamental challenges in understanding and reasoning over complex spreadsheets—particularly those involving multiple interrelated sheets and deep hierarchical structures—due to inaccurate structural modeling and unreliable execution. To address this, we propose a neuro-symbolic dual-workflow framework comprising three tightly integrated modules: structured understanding, programmatic execution, and automatic verification. A novel dynamic re-execution mechanism is introduced to iteratively detect and correct reasoning errors. We further construct SheetBench, a challenging multi-sheet benchmark designed to stress-test spreadsheet reasoning capabilities. The system integrates LLMs with a secure Python sandbox, a preloaded spreadsheet library, and a custom Excel toolkit, enabling multi-step symbolic reasoning and verifiable program generation. Evaluations across multiple public spreadsheet benchmarks demonstrate substantial accuracy improvements, with an average gain of 18.7% on complex multi-sheet tasks—validating the framework’s superior accuracy, robustness, and interpretability.
📝 Abstract
Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at https://github.com/microsoft/SheetBrain.