Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

πŸ“… 2026-04-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study challenges the prevailing assumption that performance gains in multi-LLM revision pipelines primarily stem from error correction, offering the first systematic disentanglement of their sources of improvement. Through four controlled experiments on knowledge-based multiple-choice questions and code generation tasks, the authors decompose second-round benefits into three components: re-solving, scaffolding, and content contribution, and examine how their relative impacts vary across model pairings. The findings reveal that in multiple-choice tasks, gains are predominantly driven by strong models’ ability to re-solve problems, making direct invocation more efficient; in contrast, for code generation, draft structure provides useful scaffolding, yet low-quality draft content actively harms performance. These results underscore the critical moderating roles of task type and draft quality in determining the efficacy of revision pipelines.
πŸ“ Abstract
Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.
Problem

Research questions and friction points this paper is trying to address.

multi-LLM revision
error correction
task structure
draft quality
pipeline design
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-LLM revision
re-solving
scaffolding
controlled decomposition
pipeline design
πŸ”Ž Similar Papers
No similar papers found.
J
Jingjie Ning
School of Computer Science, Carnegie Mellon University
Xueqi Li
Xueqi Li
Shenzhen University
C
Chengyu Yu
School of Computer Science, Carnegie Mellon University