Sheet as Token: A Graph-Enhanced Representation for Multi-Sheet Spreadsheet Understanding

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

159K/year
🤖 AI Summary
This work addresses the challenges posed by information dispersion, heterogeneous schemas, and implicit relationships in multi-table spreadsheets, which often undermine semantic integrity when processed with conventional block-based representations. To overcome this, the authors propose a worksheet-level tokenization approach that encodes each worksheet into a compact, dense semantic unit. They further introduce a query-driven multi-relational graph to facilitate cross-table retrieval, integrating semantic similarity, query constraints, schema consistency, and shape compatibility. Efficient inference is achieved through a multi-stage Graph Transformer architecture. Experimental results on a newly constructed multi-table corpus demonstrate that the method learns stable representations, significantly outperforming shallow graph baselines while maintaining computationally tractable overhead.
📝 Abstract
Workbook-scale spreadsheet understanding is increasingly important for language-model-based data analysis agents, but remains challenging because relevant information is often distributed across multiple sheets with heterogeneous schemas, layouts, and implicit relationships. Existing retrieval-augmented approaches typically decompose spreadsheets into rows, columns, or blocks to improve scalability; however, such chunk-centric representations can fragment worksheets into isolated text spans and weaken global sheet-level semantics. We propose Sheet as Token, a graph-enhanced framework that treats each worksheet as a unified semantic unit for multi-sheet spreadsheet retrieval. Our method extracts schema-aware records from sheet names, column headers, representative values, and layout features, and encodes each worksheet into a compact dense token. Given a natural-language query, a Graph Retriever constructs a query-specific candidate graph over sheet tokens using semantic, query-conditioned, schema-consistency, and shape-compatibility relations, and composes these channels through a multi-stage graph transformer to retrieve supporting sheet sets. Experiments on a constructed multi-sheet spreadsheet corpus show that sheet-level tokenization learns stable representations, and that graph-enhanced cross-sheet reasoning improves listwise retrieval over a shallow graph baseline with limited additional graph-side computation. These results suggest that sheet-level tokenization is a promising abstraction for scalable multi-sheet spreadsheet understanding.
Problem

Research questions and friction points this paper is trying to address.

spreadsheet understanding
multi-sheet
heterogeneous schemas
implicit relationships
global semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sheet as Token
graph-enhanced retrieval
multi-sheet spreadsheet understanding
schema-aware representation
graph transformer