Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-tuning data generation methods rely heavily on high-quality seed instructions or structured web sources, limiting scalability and diversity. To address this, we propose WebR, the first framework that automatically synthesizes high-quality instruction-response pairs directly from raw, unstructured web pages—requiring no external supervision, seed data, or predefined templates. Its core innovation is a dual-perspective paradigm: “webpage-as-instruction” and “webpage-as-response,” implemented via content-driven bidirectional role modeling, unsupervised semantic alignment, and reconstruction-based generation, augmented by lightweight filtering and quality distillation. This approach significantly enhances data diversity, domain adaptability, and scalability. Evaluated on four standard instruction-following benchmarks, WebR achieves an average 16.65% improvement over prior state-of-the-art methods, demonstrating superior generalization and efficiency in domain adaptation.

Technology Category

Application Category

📝 Abstract
The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at https://github.com/YJiangcm/WebR.
Problem

Research questions and friction points this paper is trying to address.

Generates instruction-response pairs from raw web documents
Reduces reliance on seed data quality and structural assumptions
Improves LLM instruction-following performance by 16.65%
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Web Reconstruction for IT data synthesis
Dual-perspective paradigm: Web as Instruction/Response
Enhances domain adaptation with minimal effort
🔎 Similar Papers
No similar papers found.
Y
Yuxin Jiang
The Hong Kong University of Science and Technology (Guangzhou)1, The Hong Kong University of Science and Technology2
Y
Yufei Wang
Huawei Noah’s Ark Lab3
Chuhan Wu
Chuhan Wu
WeChat AI, Tencent
Foundation ModelPretrainingPost TrainingLLM Agent
Xinyi Dai
Xinyi Dai
Noah's Ark Lab, Huawei
Information RetrievalRecommender SystemLarge Language Models
Y
Yan Xu
Huawei Noah’s Ark Lab3
Weinan Gan
Weinan Gan
Huawei Noah's Ark Lab
Large Language ModelGenerative IRAgent
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
X
Xin Jiang
Huawei Noah’s Ark Lab3
Lifeng Shang
Lifeng Shang
Huawei Noah's Ark Lab
Machine LearningComputer VisionPattern ReconitionNatural Language Processing
R
Ruiming Tang
Huawei Noah’s Ark Lab3
W
Wei Wang
The Hong Kong University of Science and Technology (Guangzhou)1, The Hong Kong University of Science and Technology2