Once Upon an Input: Reasoning via Per-Instance Program Synthesis

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) still suffer from high error rates and poor stability in complex, multi-step reasoning—particularly on algorithmic tasks. To address this, we propose Per-Instance Program Synthesis (PIPS), a method that performs iterative reasoning refinement without task-specific prompting by jointly modeling program generation, structured feedback, and dynamic confidence estimation. Its core innovation lies in instance-level program synthesis and verification: for each input sample, PIPS generates a candidate program, validates it using syntactic and semantic structural feedback to suppress erroneous outputs, and adaptively triggers retries based on estimated confidence. Evaluated across 30 benchmarks, PIPS achieves a 9.4 percentage-point improvement in harmonic accuracy over Chain-of-Thought (CoT); on algorithmic tasks, it reduces the rate of incorrect program generation by 65.1% compared to Program-of-Thought (PoT). These results demonstrate substantial gains in both reliability and generalization for multi-step reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-step reasoning accuracy in large language models
Reducing undesirable program generations in algorithmic tasks
Dynamically selecting inference methods using confidence metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates and refines per-instance programs using structural feedback
Dynamically chooses inference method via confidence metric
Reduces undesirable program generations in algorithmic tasks
🔎 Similar Papers
No similar papers found.