POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

In document conversion, high-quality annotated data for complex layouts—such as tables, mathematical formulas, and multi-column text—are scarce and prohibitively expensive to obtain manually; teacher-based distillation approaches are further constrained by teacher model limitations and poor generalization. To address this, we propose a teacher-free, two-stage self-improvement framework: (1) controlled synthetic data generation to produce large-scale, diverse training samples; and (2) iterative refinement via vision-language model fine-tuning, multi-strategy quality filtering, and retraining. The resulting POINTS-Reader model achieves state-of-the-art performance across diverse real-world document formats, significantly outperforming both open-source and commercial models of comparable scale. Its lightweight variant, POINTS-1.5, is publicly released—offering high accuracy, strong cross-format generalization, and deployment efficiency.

Technology Category

Application Category

📝 Abstract

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

Problem

Research questions and friction points this paper is trying to address.

Automated document conversion lacks accurate labeled data for complex formats

Manual annotation is costly while existing models are inaccurate

Distillation-based training limits real-world document extraction performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation-free framework for document conversion

Synthetic data generation for initial model training

Self-improvement approach with iterative filtering and retraining

🔎 Similar Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding