🤖 AI Summary
This work addresses the challenge of optimizing large language model agents under practical platform constraints—such as field truncation and context limits—which introduces a non-convex multi-objective optimization problem balancing task performance against hard operational requirements. The authors propose a novel approach that integrates Chebyshev scalarization with an exponential annealing mechanism for prompt optimization, complemented by a unified multi-objective mutation operator and textual feedback evaluation. This framework enables a smooth transition from exploration to exploitation, effectively capturing diverse solutions along non-convex Pareto fronts. Evaluated across six tasks, the method achieves an average accuracy improvement of 7.5% over the strongest baseline (up to 14.9%) and discovers more than twice as many Pareto-optimal skill variants, thereby overcoming the limitations of conventional weighted-sum strategies.
📝 Abstract
LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.