HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address modality bias and cross-modal semantic fragmentation in large language models (LLMs) for table understanding, this paper proposes a hybrid multimodal preference optimization framework. It jointly encodes table structure and content using both textual and visual modalities, and introduces a modality-consistent sampling strategy within the Direct Preference Optimization (DPO) framework to enable aligned multimodal preference learning. We further pioneer a hybrid multimodal collaborative reasoning mechanism that supports complementary semantic extraction and joint inference over tabular data. Evaluated on table question answering and fact verification tasks, our approach achieves an average improvement of 4.0%, significantly enhancing both unimodal robustness and cross-modal generalization. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. To better capture these structural semantics, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, and optimizes MLLMs to effectively learn more comprehensive table information from these multiple modalities. Specifically, HIPPO samples model responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during DPO training. Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances reasoning abilities based on unimodal table representations but also facilitates the extraction of crucial and distinct semantics from different modal representations. All data and codes are available at https://github.com/NEUIR/HIPPO.
Problem

Research questions and friction points this paper is trying to address.

Enhancing table understanding in Large Language Models
Optimizing hybrid-modal table representation learning
Improving table question answering and fact verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-modal table representations
Modality-consistent sampling strategy
Enhanced table reasoning capabilities
🔎 Similar Papers
No similar papers found.
Zhenghao Liu
Zhenghao Liu
Northeastern University
NLPInformation Retrieval
H
Haolan Wang
Department of Computer Science and Technology, Northeastern University, China
X
Xinze Li
Department of Computer Science and Technology, Northeastern University, China
Qiushi Xiong
Qiushi Xiong
Northeastern University
Natural Language ProcessingInformation Retrieval
Xiaocui Yang
Xiaocui Yang
Lecturer, Northeastern University (China)
Multimodal Sentiment AnalysisData MiningMultimodal Large Language Models
Y
Yu Gu
Department of Computer Science and Technology, Northeastern University, China
Yukun Yan
Yukun Yan
Tsinghua University
Large Language Model
Q
Qi Shi
Department of Computer Science and Technology, Institute for AI, Tsinghua University, China
Fangfang Li
Fangfang Li
Lendlease
Data miningText Miningadvertising
G
Ge Yu
Department of Computer Science and Technology, Northeastern University, China
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing