InfoSynth: Information-Guided Benchmark Synthesis for LLMs

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current benchmarks for large language models (LLMs) predominantly rely on manually curated datasets, which are costly to produce, prone to contaminating training data, and often lack novelty and diversity. To address these limitations, this work proposes InfoSynth, a novel framework that introduces information-theoretic metrics—based on KL divergence and entropy—to quantitatively measure and controllably modulate the novelty, diversity, and difficulty of generated questions. Integrating genetic algorithms, an iterative code feedback mechanism, and information-theoretic evaluation, InfoSynth establishes an end-to-end pipeline for the automatic generation and validation of Python programming problems. Experimental results demonstrate that 97% of the generated problems are accompanied by correct test cases and reference solutions, exhibiting significantly higher novelty and diversity compared to the seed dataset.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

Problem

Research questions and friction points this paper is trying to address.

benchmark synthesis

large language models

novelty

diversity

evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

information-theoretic metrics

benchmark synthesis

genetic algorithms