TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Table recognition (TR) has long been constrained by supervised learning’s reliance on large-scale annotated datasets, while open-source models significantly underperform proprietary counterparts due to high annotation costs, stringent privacy regulations, and limited resources. Method: We propose the first fully self-supervised fine-tuning framework for TR—leveraging vision-language models to jointly perform attention-guided question generation, grouped relative policy optimization, and question-answering-driven reward modeling, forming a closed-loop reinforcement learning system. Our approach requires only unlabeled table images to iteratively refine structural parsing and generate semi-structured representations. Contribution/Results: We release TRivia-3B, an open-source TR model that surpasses state-of-the-art proprietary systems—including Gemini 2.5 Pro and MinerU 2.5—across three major benchmarks, achieving new industry-leading performance without any human annotations.

Technology Category

Application Category

📝 Abstract

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia

Problem

Research questions and friction points this paper is trying to address.

Self-supervised fine-tuning for table recognition without labeled data

Bridging performance gap between proprietary and open-source table recognition models

Enabling vision-language models to learn from unlabeled table images autonomously

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised fine-tuning of vision-language models for table recognition

Uses question-answering-based reward mechanism without human annotations

Generates diverse questions via attention-guided module for autonomous learning

🔎 Similar Papers

No similar papers found.