DocAtlas: Multilingual Document Understanding Across 80+ Languages

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
This study addresses the significant performance limitations of multilingual document understanding in low-resource languages, which stem from scarce training data and annotation biases in existing models. To overcome these challenges, the authors propose a learning-free dual-rendering pipeline—combining differential DOCX rendering and LaTeX synthesis—to generate a high-fidelity OCR dataset spanning 82 languages and 9 tasks, accompanied by unified DocTag structured annotations. Furthermore, they introduce Direct Preference Optimization (DPO) to multilingual document understanding for the first time, circumventing the out-of-domain performance degradation commonly induced by supervised fine-tuning. Experimental results demonstrate consistent improvements of 1.9% and 1.8% in in-domain and out-of-domain accuracy, respectively, without compromising zero-shot performance on base languages. The resulting DocAtlas-DeepSeek model surpasses the strongest baseline by 1.7%.
📝 Abstract
Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.
Problem

Research questions and friction points this paper is trying to address.

multilingual document understanding
low-resource languages
training data scarcity
annotation bias
OCR datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

DocAtlas
multilingual document understanding
differential rendering
synthetic LaTeX generation
Direct Preference Optimization
🔎 Similar Papers
No similar papers found.