ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficient and accurate conversion of web HTML to structured formats (Markdown/JSON), this paper introduces ReaderLM-v2: a compact 1.5B-parameter language model supporting 512K-token context windows. Methodologically, we propose the first “draft–refine–critique” three-stage synthetic data pipeline and a unified training framework integrating continuous pretraining with multi-objective supervised fine-tuning—jointly optimizing for long-context modeling and multi-format (HTML→Markdown/JSON) supervision. Evaluated on hundred-thousand-token-long document benchmarks, ReaderLM-v2 outperforms GPT-4o-2024-08-06 by 15–20% in accuracy while substantially reducing inference cost. Key contributions include: (1) the first lightweight, domain-specialized model for ultra-long HTML parsing; (2) a scalable, high-fidelity synthetic data paradigm; and (3) an end-to-end structured extraction solution that jointly optimizes precision and computational efficiency.

Technology Category

Application Category

📝 Abstract
We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.
Problem

Research questions and friction points this paper is trying to address.

Efficient web content extraction from HTML
Transforming HTML to clean Markdown or JSON
Optimizing computational efficiency for large documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage data synthesis pipeline
Unified training framework
Efficient HTML to Markdown/JSON conversion
🔎 Similar Papers
No similar papers found.
F
Feng Wang
Jina AI GmbH
Zesheng Shi
Zesheng Shi
Harbin Institute of Technology
nlp
B
Bo Wang
Jina AI GmbH
N
Nan Wang
Jina AI GmbH
H
Han Xiao
Jina AI GmbH