How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the lack of effective evaluation and optimization mechanisms for large language models (LLMs) in generating real-world, step-by-step procedural instructions. We propose How2Bench, a balanced benchmark constructed from 351,000 web-mined procedures, along with How2Score—a novel scoring protocol driven by an LLM-as-a-judge framework that enables the first fully automated construction of a high-quality procedural generation benchmark from web data. By distilling a judge model achieving 80.5% agreement with human evaluators, our approach substantially reduces annotation costs. Integrating this scoring mechanism with reinforcement learning guides model improvement early in pretraining, avoiding mere memorization or superficial format imitation. Experiments demonstrate average gains exceeding 10 points across three models on How2Bench, without performance degradation on general-purpose benchmarks.

Technology Category

Application Category

📝 Abstract

Generating step-by-step"how-to"procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by>10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.

Problem

Research questions and friction points this paper is trying to address.

how-to procedures

procedural validity

large language models

capability evaluation

step-by-step generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

procedural generation

web mining

LLM evaluation