🤖 AI Summary
Existing LLM evaluation benchmarks suffer from data fragmentation, uneven domain coverage, and poor customizability, hindering fine-grained assessment in specialized subfields such as mathematics and programming. To address these limitations, we propose the first dynamic, extensible, unified evaluation platform. Our method integrates (1) 38 domains and 303,000 heterogeneous questions; (2) a novel benchmark organization paradigm featuring automated classification and domain-aware curation, enabling on-demand customization and continuous updates; and (3) a data integration framework combining multi-source web crawling, semantic clustering, and metadata standardization, coupled with a modular evaluation pipeline and an API-driven, programmable assessment interface. Experiments across major LLM families demonstrate that our platform significantly improves benchmark reusability and model comparison transparency. Moreover, it systematically uncovers a previously overlooked weakness in reasoning capabilities—highlighting critical gaps in current evaluation practices.
📝 Abstract
As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.