Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses a key limitation in existing test-time aggregation methods, which typically rely on isolated evaluations or answer frequencies while neglecting pairwise comparisons among reasoning trajectories. The authors propose formulating aggregation as a constrained Ising-type energy minimization problem, thereby explicitly modeling pairwise interactions between trajectories for the first time. This framework unifies voting and weighting strategies and is theoretically grounded under an answer homogeneity assumption. Leveraging an LLM-as-a-judge to construct the interaction matrix and employing efficient approximation algorithms, the approach enables scalable test-time aggregation. Extensive experiments demonstrate consistent and significant improvements over current baselines across diverse settings—including mathematical and code reasoning benchmarks—under varying tasks, discriminator models, trajectory budgets, and generation configurations.

📝 Abstract

This paper studies test-time aggregation, an approach that generates multiple reasoning traces and aggregates them into a final answer. Most existing methods rely on evaluation signals collected from candidate traces in isolation or answer frequencies, while ignoring comparative interactions among candidates. We propose Joint Consistency (JC), formulated as a constrained Ising-type energy minimization problem, where independent evaluation signals act as external fields and pairwise comparisons act as interactions. JC provides a unified framework for test-time aggregation that subsumes existing voting and weighted aggregation methods as special cases. Our construction of the interaction matrix leverages LLM-as-a-judge comparisons, and admits a theoretical interpretation under answer-level homogeneity assumptions. Moreover, we develop an efficient approximation strategy that makes interaction modeling practical for large-scale test-time aggregation. Experiments on math and code reasoning benchmarks show that JC consistently outperforms existing baselines across tasks, judge models, trace budgets, and trace-generation settings.

Problem

Research questions and friction points this paper is trying to address.

test-time aggregation

reasoning traces

answer aggregation

comparative interactions

evaluation signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Consistency

Test-Time Aggregation

Energy Minimization