🤖 AI Summary
Existing LLM workflow optimization methods rely solely on binary success/failure signals, discarding fine-grained failure-mode information and hindering accurate modeling and optimization of the failure distribution. This work reframes the problem from a distributional perspective—introducing “Expected Failure Quality” (EFQ) as the optimization objective—to shift paradigms from scalar scoring to geometric reshaping of the failure distribution. To realize this, we construct a Failure Signature Space (FSS), estimate failure density via an adversarial counterexample pool, and propose the CE-Graph framework: it performs operator-constrained graph editing in high-density failure regions using a Propose-and-Verify mechanism. Evaluated on mathematical reasoning, code generation, and question-answering benchmarks, our method significantly outperforms strong baselines at lower computational cost, empirically validating that systematic distribution-level optimization enhances workflow robustness.
📝 Abstract
Optimizing LLM-based workflows is typically formulated as a global search, where candidate workflows are evaluated based on a scalar metric. This paradigm, however, suffers from a critical flaw: information collapse. By reducing rich, multi-step execution traces to simple success/failure signals, existing methods are rendered blind to the underlying structure of failures, fundamentally preventing them from modeling the workflow's failure distribution. We reconceptualize this challenge as a distributional problem. We propose a new paradigm where the optimization goal is not to maximize a scalar score, but to directly minimize a workflow's Expected Failure Mass, i.e., the integral of its failure probability density function defined over a high-dimensional Failure Signature Space (FSS). This distributional lens allows us to move from inefficient, zero-order optimization to a principled, gradient-like descent on the failure landscape itself. We introduce CE-Graph, a framework that operationalizes this paradigm through a novel, failure-driven refinement process. CE-Graph approximates the failure distribution from a pool of counterexamples, identifies its densest regions as recurring failure modes, and applies targeted, operator-constrained graph edits via a Propose-and-Verify mechanism to greedily reduce the failure mass. On math, code, and QA benchmarks, our CE-Graph achieves higher robustness at a significantly lower cost than strong baselines. This suggests that a system's reliability emerges not from avoiding failures, but from systematically learning and reshaping the geometric structure of its failure distributions.