Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current large language models exhibit limited performance on complex, multi-step spreadsheet tasks in real-world settings. This work proposes the first reinforcement learning fine-tuning framework specifically designed for spreadsheet-oriented tasks, training specialized agents within a realistic Microsoft Excel simulation environment named Spreadsheet Gym. The approach integrates automated paired data collection, refined tool invocation, and a routing mechanism tailored to spreadsheet operations, and introduces Domain-Spreadsheet, a domain-specific benchmark. Experimental results demonstrate that the method significantly enhances task completion capability, improving the Pass@1 score of Qwen3-4B-Thinking-2507 from 12.0% to 23.4% on SpreadsheetBench and from 8.4% to 17.2% on Domain-Spreadsheet.

📝 Abstract

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.

Problem

Research questions and friction points this paper is trying to address.

spreadsheet agents

large language models

complex workflows

realistic tasks

automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Spreadsheet Agents

LLM Fine-tuning