Scaling Unverifiable Rewards: A Case Study on Visual Insights

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-stage data science tasks, the unverifiability of terminal rewards leads to error accumulation during test-time scaling (TTS). To address this, we propose Selective TTS—a novel framework that first distributes reasoning computation across stages and prunes low-quality branches early to mitigate judge drift and stabilize multi-step inference. Methodologically, we design an end-to-end multi-agent collaborative pipeline integrating stage-specific interpretable adjudicator models and a Kendall’s τ-driven human expert calibration mechanism. Experiments demonstrate that, under fixed computational budgets, insight quality improves from an average score of 61.64 to 65.86, with significantly reduced variance. Moreover, adjudicator outputs achieve strong agreement (τ = 0.55) with human experts, validating the framework’s reliability and generalizability across diverse reasoning tasks.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
Problem

Research questions and friction points this paper is trying to address.

Addresses multi-stage tasks lacking verifiable rewards for refinement
Mitigates error accumulation in judge-based multi-agent pipelines
Improves quality of open-ended outputs like visual insights under fixed compute
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective TTS framework scales inference across multi-agent stages
Early pruning of low-quality branches using process-specific judges
LLM-based judge model aligned with human experts for reliability
🔎 Similar Papers
No similar papers found.