🤖 AI Summary
This work addresses the limited performance of large language models (LLMs) in high-dimensional software engineering optimization tasks, where they often fail to surpass Bayesian optimization. For the first time, it systematically compares human- and AI-generated domain knowledge injection strategies and introduces four novel architectures: Human-feedback-informed Domain Knowledge Prompting (H-DKP), Adaptive Multi-stage Prompting (AMP), Dimension-aware Progressive Refinement (DAPR), and a hybrid approach combining statistical scouting with RAG-enhanced knowledge integration (HKMA). By leveraging a multi-stage, dimension-aware, and hybrid knowledge fusion framework, the proposed methods effectively incorporate structured domain knowledge to significantly enhance LLMs’ ability to generate high-quality initial solutions. Evaluated on the MOOT high-dimensional benchmark, the approaches markedly reduce the Chebyshev distance to the optimal solution and, according to Scott-Knott clustering, outperform existing LLM warm-start baselines.
📝 Abstract
Background/Context: Large Language Models (LLMs) demonstrate strong performance on low-dimensional software engineering optimization tasks ($\le$11 features) but consistently underperform on high-dimensional problems where Bayesian methods dominate. A fundamental gap exists in understanding how systematic integration of domain knowledge (whether from humans or automated reasoning) can bridge this divide. Objective/Aim: We compare human versus artificial intelligence strategies for generating domain knowledge. We systematically evaluate four distinct architectures to determine if structured knowledge integration enables LLMs to generate effective warm starts for high-dimensional optimization. Method: We evaluate four approaches on MOOT datasets stratified by dimensionality: (1) Human-in-the-Loop Domain Knowledge Prompting (H-DKP), utilizing asynchronous expert feedback loops; (2) Adaptive Multi-Stage Prompting (AMP), implementing sequential constraint identification and validation; (3) Dimension-Aware Progressive Refinement (DAPR), conducting optimization in progressively expanding feature subspaces; and (4) Hybrid Knowledge-Model Approach (HKMA), synthesizing statistical scouting (TPE) with RAG-enhanced prompting. Performance is quantified via Chebyshev distance to optimal solutions and ranked using Scott-Knott clustering against an established baseline for LLM generated warm starts. Note that all human studies conducted as part of this study will comply with the policies of our local Institutional Review Board.