How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of effective evaluation and optimization mechanisms for large language models (LLMs) in generating real-world, step-by-step procedural instructions. We propose How2Bench, a balanced benchmark constructed from 351,000 web-mined procedures, along with How2Scoreβ€”a novel scoring protocol driven by an LLM-as-a-judge framework that enables the first fully automated construction of a high-quality procedural generation benchmark from web data. By distilling a judge model achieving 80.5% agreement with human evaluators, our approach substantially reduces annotation costs. Integrating this scoring mechanism with reinforcement learning guides model improvement early in pretraining, avoiding mere memorization or superficial format imitation. Experiments demonstrate average gains exceeding 10 points across three models on How2Bench, without performance degradation on general-purpose benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Generating step-by-step"how-to"procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by>10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.
Problem

Research questions and friction points this paper is trying to address.

how-to procedures
procedural validity
large language models
capability evaluation
step-by-step generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

procedural generation
web mining
LLM evaluation
reinforcement learning
distillation