🤖 AI Summary
This work addresses the challenge of deploying small language models (SLMs) in enterprise settings, where their inability to self-correct reasoning errors limits performance, while large language models remain impractical due to cost and data sovereignty concerns. The authors propose Semantic Gradient Descent (SGDe), a novel framework that enables gradient-like optimization in discrete semantic spaces. SGDe compiles agent workflows into executable plans comprising DAG topology, system prompts, and deterministic code, then leverages natural language critiques generated by large models as “semantic gradients” to iteratively refine SLMs through capability offloading and structural consensus mechanisms. Theoretical analysis under the PAC learning framework establishes low sample complexity for convergence. Experiments on a GSM-Hard–derived test set demonstrate that SGDe achieves 91.3%–99.3% accuracy within just 3–5 iterations, outperforming existing prompt optimization methods by 26.3–34.3 percentage points.
📝 Abstract
Enterprise deployment of small language models (SLMs) is constrained by epistemic asymmetry: SLMs cannot self-correct reasoning errors, while frontier LLMs are prohibitively costly and face data sovereignty limits for high-volume use. We propose Semantic Gradient Descent (SGDe), a teacher-student framework that compiles agentic workflows into discrete execution plans comprising DAG topologies, system prompts, and deterministic executable code. The trailing "e" distinguishes SGDe from stochastic gradient descent. SGDe operates in a discrete semantic space where a frontier teacher generates natural-language critiques acting as directional gradients to iteratively refine the SLM's workflow artefacts. We formalise SGDe within a PAC learning framework, establishing sample-complexity bounds that enable convergence with as few as three training examples on targeted synthetic tasks by leveraging the teacher as a statistical prior. On a GSM-Hard-derived test set built via adversarial synthesis, compiled workflows reach 91.3% accuracy at m=5 and 99.3% at m=3 within the small-m regime motivated by Corollary 1, a +26.3% to +34.3% absolute improvement over state-of-the-art prompt optimisers. In the emerging paradigm of harness engineering, SGDe treats placement of deterministic code (which subtasks to delegate to a Python runtime versus retain as LLM calls) as a trace-driven, per-node optimisation target, generalising the whole-problem offloading of PAL and PoT. The teacher compiles two complementary deterministic structures: capability offloading, which delegates subtasks to Python when the SLM cannot execute them reliably, and structural consensus, which wraps variance-limited reasoning steps in fan-out/fan-in subgraphs aggregated by deterministic voting.