POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In document conversion, high-quality annotated data for complex layouts—such as tables, mathematical formulas, and multi-column text—are scarce and prohibitively expensive to obtain manually; teacher-based distillation approaches are further constrained by teacher model limitations and poor generalization. To address this, we propose a teacher-free, two-stage self-improvement framework: (1) controlled synthetic data generation to produce large-scale, diverse training samples; and (2) iterative refinement via vision-language model fine-tuning, multi-strategy quality filtering, and retraining. The resulting POINTS-Reader model achieves state-of-the-art performance across diverse real-world document formats, significantly outperforming both open-source and commercial models of comparable scale. Its lightweight variant, POINTS-1.5, is publicly released—offering high accuracy, strong cross-format generalization, and deployment efficiency.

Technology Category

Application Category

📝 Abstract
High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.
Problem

Research questions and friction points this paper is trying to address.

Automated document conversion lacks accurate labeled data for complex formats
Manual annotation is costly while existing models are inaccurate
Distillation-based training limits real-world document extraction performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation-free framework for document conversion
Synthetic data generation for initial model training
Self-improvement approach with iterative filtering and retraining
🔎 Similar Papers
No similar papers found.
Y
Yuan Liu
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Z
Zhongyin Zhao
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Le Tian
Le Tian
University of Antwerpen - imec
Internet of thingssensor networksIEEE 802.11ah
H
Haicheng Wang
Pattern Recognition Center, WeChat AI, Tencent Inc, China; Shanghai Jiao Tong University
Xubing Ye
Xubing Ye
Tsinghua University
VIsion Language ModelLarge Language Model
Y
Yangxiu You
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Z
Zilin Yu
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Chuhan Wu
Chuhan Wu
WeChat AI, Tencent
Foundation ModelPretrainingPost TrainingLLM Agent
Xiao Zhou
Xiao Zhou
M.Phil student in HKUST
Autonomous DrivingDRL
Y
Yang Yu
Pattern Recognition Center, WeChat AI, Tencent Inc, China
J
Jie Zhou
Pattern Recognition Center, WeChat AI, Tencent Inc, China