🤖 AI Summary
This study addresses the limited reasoning performance of large language models (LLMs) on tabular fact verification tasks and the desire to avoid costly fine-tuning by systematically evaluating instruction optimization methods within the DSPy framework. It presents the first comprehensive analysis of how combining four prompting strategies—Direct Prediction, Chain-of-Thought, ReAct, and CodeAct—with three optimizers—COPRO, MiPROv2, and SIMBA—affects verification accuracy, reasoning trajectories, and tool invocation behavior. Experimental results demonstrate that instruction optimization substantially enhances performance: MiPROv2 yields the most stable gains with Chain-of-Thought, while SIMBA achieves optimal results with ReAct, particularly on larger models; notably, Chain-of-Thought proves especially effective for smaller models.
📝 Abstract
Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.