TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Table image understanding faces two key challenges: difficulty in localizing question-relevant regions and low visual feature information density due to background redundancy. To address these, we propose a question-guided compact visual representation framework. Our method introduces three core innovations: (1) a progressive question injection mechanism that dynamically fuses question semantics into successive layers of a Vision Transformer (ViT); (2) a token-focused training strategy that jointly optimizes background token pruning and salient region enhancement; and (3) a lightweight visual encoder design. Evaluated on multiple table understanding benchmarks, our approach consistently outperforms leading open-source and closed-source multimodal large language models (MLLMs). It achieves higher accuracy while reducing computational cost by 27% (FLOPs) and memory footprint by 30%, demonstrating a synergistic improvement in both efficiency and performance.

Technology Category

Application Category

📝 Abstract

Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

Problem

Research questions and friction points this paper is trying to address.

Generates compact question-aware visual features for tables

Reduces redundancy by pruning background tokens efficiently

Mitigates information loss through token focusing strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive question conditioning for question-aware features

Pruning strategy to discard background tokens

Token focusing training to concentrate essential information

🔎 Similar Papers

TableRAG: Million-Token Table Understanding with Language Models