🤖 AI Summary
In multi-stage data science tasks, the unverifiability of terminal rewards leads to error accumulation during test-time scaling (TTS). To address this, we propose Selective TTS—a novel framework that first distributes reasoning computation across stages and prunes low-quality branches early to mitigate judge drift and stabilize multi-step inference. Methodologically, we design an end-to-end multi-agent collaborative pipeline integrating stage-specific interpretable adjudicator models and a Kendall’s τ-driven human expert calibration mechanism. Experiments demonstrate that, under fixed computational budgets, insight quality improves from an average score of 61.64 to 65.86, with significantly reduced variance. Moreover, adjudicator outputs achieve strong agreement (τ = 0.55) with human experts, validating the framework’s reliability and generalizability across diverse reasoning tasks.
📝 Abstract
Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.