Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the lack of controllable evaluation for modality–model alignment—specifically, the matching between tabular representations (textual vs. visual) and model architectures (LLMs vs. multimodal LLMs)—in table question answering (TQA). We introduce the first fine-grained, controllable benchmark grounded in question complexity and table scale. To overcome performance bottlenecks induced by fixed representation choices, we propose FRES (Flexible Representation Selection), a dynamic method that adaptively selects the optimal input modality (text or image) based on task characteristics. Extensive experiments across seven state-of-the-art models demonstrate that representation–model compatibility is highly task-dependent, with no universally optimal pairing. FRES consistently enhances model generalization, yielding an average 10% performance gain across diverse scenarios—thereby validating both the effectiveness and necessity of dynamic, context-aware modality selection in TQA.

Technology Category

Application Category

📝 Abstract

In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to or even better than using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and models from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.

Problem

Research questions and friction points this paper is trying to address.

Compare text vs image table representations in TQA

Evaluate model effectiveness by question complexity and table size

Propose dynamic representation selection to improve TQA performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled study on table representations and models

Dynamic selection method FRES improves performance

Benchmark based on existing TQA datasets

🔎 Similar Papers

No similar papers found.