🤖 AI Summary
To address the lack of systematic guidance for selecting and composing reasoning-time techniques (e.g., verification, critique, unit testing), unclear inter-technique interaction mechanisms, and combinatorial explosion in the design space, this paper introduces the first architecture search framework tailored for reasoning-time techniques. We formulate LLM system construction as a hyperparameter optimization problem, defining a modular and extensible design space for reasoning-time techniques and a stackable compositional paradigm. Our framework integrates Bayesian optimization, multi-model collaborative scheduling, and diverse composition strategies—including fusion, ranking, verification, and critique. Evaluated across eight benchmarks (MT-Bench, MATH, CodeContests, etc.), our approach achieves an average accuracy improvement of 15.1 percentage points, significantly outperforming GPT-4o and Claude 3.5 Sonnet.
📝 Abstract
Inference-time techniques are emerging as highly effective tools to enhance large language model (LLM) capabilities. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of individual inference-time techniques and the interactions between them. Additionally, efficiently and automatically searching the space of model choices, inference-time techniques, and their compositions is challenging due to the large design space. To address these challenges, we introduce Archon, a modular framework for selecting, combining, and stacking layers of inference-time techniques to construct optimized LLM systems for target benchmarks. Rather than relying on a single LLM called once, we leverage a diverse set of LLMs and inference-time techniques, creating LLM systems greater than the sum of their parts. Archon defines an extensible design space, encompassing techniques such as generation ensembling, repeated sampling, ranking, fusion, critiquing, verification, and unit testing. It transforms the problem of building LLM systems into a hyperparameter optimization objective. Given the available LLMs, inference-time techniques, and compute budget, Archon utilizes hyperparameter search techniques to discover optimized architectures for target benchmark(s). We evaluate Archon architectures across a range of instruction-following, reasoning, and coding benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. Archon architectures outperform frontier models, such as GPT-4o and Claude 3.5 Sonnet, on these benchmarks, achieving an average accuracy increase of 15.1 percentage points by using all available LLMs. We make our code and datasets available publicly on Github: https://github.com/ScalingIntelligence/Archon.