DocAtlas: Multilingual Document Understanding Across 80+ Languages

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This study addresses the significant performance limitations of multilingual document understanding in low-resource languages, which stem from scarce training data and annotation biases in existing models. To overcome these challenges, the authors propose a learning-free dual-rendering pipeline—combining differential DOCX rendering and LaTeX synthesis—to generate a high-fidelity OCR dataset spanning 82 languages and 9 tasks, accompanied by unified DocTag structured annotations. Furthermore, they introduce Direct Preference Optimization (DPO) to multilingual document understanding for the first time, circumventing the out-of-domain performance degradation commonly induced by supervised fine-tuning. Experimental results demonstrate consistent improvements of 1.9% and 1.8% in in-domain and out-of-domain accuracy, respectively, without compromising zero-shot performance on base languages. The resulting DocAtlas-DeepSeek model surpasses the strongest baseline by 1.7%.

📝 Abstract

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

Problem

Research questions and friction points this paper is trying to address.

multilingual document understanding

low-resource languages

training data scarcity

annotation bias

OCR datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

DocAtlas

multilingual document understanding

differential rendering