🤖 AI Summary
To address the challenge of efficient and accurate conversion of web HTML to structured formats (Markdown/JSON), this paper introduces ReaderLM-v2: a compact 1.5B-parameter language model supporting 512K-token context windows. Methodologically, we propose the first “draft–refine–critique” three-stage synthetic data pipeline and a unified training framework integrating continuous pretraining with multi-objective supervised fine-tuning—jointly optimizing for long-context modeling and multi-format (HTML→Markdown/JSON) supervision. Evaluated on hundred-thousand-token-long document benchmarks, ReaderLM-v2 outperforms GPT-4o-2024-08-06 by 15–20% in accuracy while substantially reducing inference cost. Key contributions include: (1) the first lightweight, domain-specialized model for ultra-long HTML parsing; (2) a scalable, high-fidelity synthetic data paradigm; and (3) an end-to-end structured extraction solution that jointly optimizes precision and computational efficiency.
📝 Abstract
We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.