VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the lack of a unified evaluation benchmark for vision-tabular multimodal learning, particularly in high-stakes domains such as healthcare. To bridge this gap, we introduce VT-Bench, the first cross-domain vision-tabular benchmark, encompassing nine domains, 14 datasets, and 756,000 samples, supporting both discriminative prediction and generative reasoning tasks. We systematically evaluate 23 representative models, including unimodal baselines, specialized multimodal architectures, general-purpose vision-language models, and tool-augmented approaches. VT-Bench establishes a standardized platform for rigorous assessment, reveals critical limitations of current methods, and provides strong baselines to facilitate the development of domain-specific foundation models in vision-tabular learning.
📝 Abstract
Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: https://github.com/Ziyi-Jia990/VT-Bench
Problem

Research questions and friction points this paper is trying to address.

visual-tabular learning
multi-modal learning
benchmark
discriminative prediction
generative reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-tabular learning
unified benchmark
multi-modal foundation models
discriminative and generative tasks
vision-language models
🔎 Similar Papers
No similar papers found.
Z
Zi-Yi Jia
School of Intelligence Science and Technology, Nanjing University, China; National Key Laboratory for Novel Software Technology, Nanjing University, China
Z
Zi-Jian Cheng
School of Intelligence Science and Technology, Nanjing University, China; National Key Laboratory for Novel Software Technology, Nanjing University, China
X
Xin-Yue Zhang
School of Intelligence Science and Technology, Nanjing University, China; National Key Laboratory for Novel Software Technology, Nanjing University, China
Kun-Yang Yu
Kun-Yang Yu
LAMDA Group, Nanjing University
Machine Learning
Zhi Zhou
Zhi Zhou
Principal Computational Scientist, Argonne National Laboratory
SimulationOptimizationStatisticsMarketsEnergy
Yu-Feng Li
Yu-Feng Li
Professor, Nanjing University
Machine Learning
Lan-Zhe Guo
Lan-Zhe Guo
LAMDA Group, Nanjing University
Machine Learning