SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Structured information extraction from semi-structured web content (e.g., HTML tables, infoboxes) suffers from poor generalization and high inference overhead when relying on large language models (LLMs) for page-by-page processing. To address this, we propose a reinforcement learning (RL)-based approach for generating reusable extraction scripts, leveraging the strong layout consistency across pages of the same website. We introduce a layout-aware reward mechanism—the first application of RL to script generation for web information extraction. Our method combines synthetic data pretraining with iterative refinement on unlabeled CommonCrawl data, substantially reducing dependence on large models. Experiments demonstrate that our generated scripts outperform strong baselines by over 13% in extraction quality and boost GPT-4o’s downstream question-answering accuracy by more than 4%. The approach achieves efficient, generalizable, and resource-light large-scale information extraction.

Technology Category

Application Category

📝 Abstract

Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.

Problem

Research questions and friction points this paper is trying to address.

Extracting structured data from semi-structured web content

Reducing resource-intensive per-page LLM inference methods

Generating reusable scripts for scalable web information extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning with layout similarity reward

Generates reusable scripts for structurally similar webpages

Iteratively trains on synthetic CommonCrawl data annotations

🔎 Similar Papers

No similar papers found.