ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of efficient and accurate conversion of web HTML to structured formats (Markdown/JSON), this paper introduces ReaderLM-v2: a compact 1.5B-parameter language model supporting 512K-token context windows. Methodologically, we propose the first “draft–refine–critique” three-stage synthetic data pipeline and a unified training framework integrating continuous pretraining with multi-objective supervised fine-tuning—jointly optimizing for long-context modeling and multi-format (HTML→Markdown/JSON) supervision. Evaluated on hundred-thousand-token-long document benchmarks, ReaderLM-v2 outperforms GPT-4o-2024-08-06 by 15–20% in accuracy while substantially reducing inference cost. Key contributions include: (1) the first lightweight, domain-specialized model for ultra-long HTML parsing; (2) a scalable, high-fidelity synthetic data paradigm; and (3) an end-to-end structured extraction solution that jointly optimizes precision and computational efficiency.

Technology Category

Application Category

📝 Abstract

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

Problem

Research questions and friction points this paper is trying to address.

Efficient web content extraction from HTML

Transforming HTML to clean Markdown or JSON

Optimizing computational efficiency for large documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage data synthesis pipeline

Unified training framework

Efficient HTML to Markdown/JSON conversion

🔎 Similar Papers

No similar papers found.

Authors to Follow