🤖 AI Summary
This work addresses the lack of a unified evaluation benchmark for vision-tabular multimodal learning, particularly in high-stakes domains such as healthcare. To bridge this gap, we introduce VT-Bench, the first cross-domain vision-tabular benchmark, encompassing nine domains, 14 datasets, and 756,000 samples, supporting both discriminative prediction and generative reasoning tasks. We systematically evaluate 23 representative models, including unimodal baselines, specialized multimodal architectures, general-purpose vision-language models, and tool-augmented approaches. VT-Bench establishes a standardized platform for rigorous assessment, reveals critical limitations of current methods, and provides strong baselines to facilitate the development of domain-specific foundation models in vision-tabular learning.
📝 Abstract
Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models.
Benchmark: https://github.com/Ziyi-Jia990/VT-Bench