Multimodal Tabular Reasoning with Privileged Structured Information

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of poor textual representation quality in table images, which hinders multi-step logical reasoning. To this end, we propose Turbo—a framework that leverages privileged structured information available during training to enhance multimodal large language models’ (MLLMs) table reasoning capabilities. Our method integrates a vision encoder with DeepSeek-R1-based MLLM and employs iterative reasoning-path generation, selection, and fine-tuning for efficient knowledge transfer. Key contributions include: (1) a structure-aware reasoning trajectory generator that explicitly aligns tabular semantics with visual features; and (2) an advantage-path iterative filtering strategy that bridges the modality gap between visual and structured representations. Evaluated on multiple benchmarks, Turbo achieves state-of-the-art performance using only 9K training samples—yielding an average 7.2% improvement and significantly advancing table image understanding and structured logical reasoning.

Technology Category

Application Category

📝 Abstract
Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation ({sc Turbo}), a new framework for multimodal tabular reasoning with privileged structured tables. {sc Turbo} benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, {sc Turbo} repeatedly generates and selects the advantageous reasoning paths, further enhancing the model's tabular reasoning ability. Experimental results demonstrate that, with limited ($9$k) data, {sc Turbo} achieves state-of-the-art performance ($+7.2%$ vs. previous SOTA) across multiple datasets.
Problem

Research questions and friction points this paper is trying to address.

Enabling tabular reasoning from table images using privileged structured information
Aligning structured data with visual representations for accurate reasoning
Transferring structured reasoning skills to multimodal language models effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages privileged structured information for training
Uses structure-aware reasoning trace generator
Generates and selects advantageous reasoning paths
🔎 Similar Papers
No similar papers found.