🤖 AI Summary
Existing multimodal table understanding methods lack explicit supervision for multi-step reasoning, often yielding short, inaccurate answers with opaque inference processes. To address this limitation, this work proposes CoReTab, a novel framework that introduces, for the first time, a code-driven reasoning mechanism. By generating executable Python code, CoReTab constructs scalable, interpretable, and automatically verifiable multi-step reasoning annotations, enabling the creation of a large-scale dataset. The authors employ a three-stage fine-tuning strategy to optimize open-source multimodal large language models (MLLMs), achieving significant performance gains across 17 MMTab benchmarks: accuracy improves by 6.2% on table question answering, 5.7% on fact verification, and 25.6% on table structure understanding. Moreover, the framework produces transparent and verifiable reasoning traces, enhancing model interpretability and trustworthiness.
📝 Abstract
Existing datasets for multimodal table understanding, such as MMTab, primarily provide short factual answers without explicit multi-step reasoning supervision. Models trained on these datasets often generate brief responses that offers insufficient accuracy and limited interpretability into how these models arrive at the final answer. We introduce CoReTab, a code-driven reasoning framework that produces scalable, interpretable, and automatically verifiable annotations by coupling multi-step reasoning with executable Python code. Using the CoReTab framework, we curate a dataset of 115K verified samples averaging 529 tokens per response and fine-tune open-source MLLMs through a three-stage pipeline. We evaluate the resulting model trained on CoReTab across 17 MMTab benchmarks spanning table question answering, fact verification, and table structure understanding. Our model achieves significant gains of +6.2%, +5.7%, and +25.6%, respectively, over MMTab-trained baselines, while producing transparent and verifiable reasoning traces. These results establish CoReTab as a robust and generalizable supervision framework for improving multi-step reasoning in multimodal table understanding.