🤖 AI Summary
This work addresses the limitations of existing automatic prompt optimization methods, which treat prompts as monolithic strings and thus struggle to model reusable sub-behaviors, resulting in fragile updates and poor adaptability across inputs. To overcome this, the authors propose Prompt Codebooks (PCO), a novel framework that introduces, for the first time, a discrete codebook-based compositional prompt optimization mechanism. PCO reformulates prompt construction as dynamic selection and composition from a finite set of natural language “instinct” units. It employs an LLM-driven encoder–generator–critic architecture to jointly train the codebook and routing policy while keeping the target model frozen, leveraging a linguistic value minimax objective and textual gradient decomposition to enable instance-specific customization. Evaluated on Qwen3-8B and LLaMA-3.1-8B, PCO achieves gains up to 30.36 points across six benchmarks over the strongest baseline, GEPA, while compressing prompt length to 1/14.1 of MIPROv2 and 1/3.0 of GEPA using only 16 atomic units.
📝 Abstract
Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.