IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This study addresses the critical gap in evaluating large language models (LLMs) for industrial procurement scenarios, where compliance with Chinese national standards and safety constraints is essential yet overlooked by mainstream benchmarks. The work proposes the first evaluation framework that explicitly incorporates standard compliance and safety considerations, introducing a comprehensive benchmark spanning seven capability dimensions, ten industrial sectors, and multilingual aligned questions, alongside an independent safety violation detection mechanism. Leveraging retrieval-based question filtering, Qwen3-Max as an adjudicator (κ_w = 0.798), and a dual-track assessment of correctness and safety, the authors evaluate 17 Chinese and 8 cross-lingual models. Results reveal that even the top-performing system achieves only 2.083 out of 3, with safety violations substantially altering model rankings—highlighting the urgent need for source-aligned, safety-aware evaluation paradigms in industrial applications.

📝 Abstract

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $κ_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

Problem

Research questions and friction points this paper is trying to address.

industrial procurement

safety violation

standards compliance

LLM evaluation

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

IndustryBench

safety-aware evaluation

standard-grounded QA