Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study challenges the prevailing assumption that performance gains in multi-LLM revision pipelines primarily stem from error correction, offering the first systematic disentanglement of their sources of improvement. Through four controlled experiments on knowledge-based multiple-choice questions and code generation tasks, the authors decompose second-round benefits into three components: re-solving, scaffolding, and content contribution, and examine how their relative impacts vary across model pairings. The findings reveal that in multiple-choice tasks, gains are predominantly driven by strong models’ ability to re-solve problems, making direct invocation more efficient; in contrast, for code generation, draft structure provides useful scaffolding, yet low-quality draft content actively harms performance. These results underscore the critical moderating roles of task type and draft quality in determining the efficacy of revision pipelines.

Technology Category

Application Category

📝 Abstract

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

Problem

Research questions and friction points this paper is trying to address.

multi-LLM revision

error correction

task structure

draft quality

pipeline design

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-LLM revision

re-solving

scaffolding