ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the high computational cost, attribution ambiguity, and overthinking inherent in uniform brute-force sampling during large language model inference. To overcome these limitations, the authors propose ODAR-Expert, an adaptive routing framework that dynamically assesses query difficulty via amortized active inference and routes requests to either fast or slow reasoning agents accordingly. The framework integrates a risk-sensitive free energy minimization mechanism to fuse agent outputs, leveraging varentropy-based uncertainty quantification and heterogeneous agent policies. Evaluated across 23 benchmarks, ODAR-Expert achieves Pareto-optimal trade-offs between accuracy and efficiency—attaining 98.2% on MATH and 54.8% on HLE—while reducing computational costs by 82% under open-source stacks, substantially outperforming homogeneous sampling approaches.

Technology Category

Application Category

📝 Abstract

The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose ODAR-Expert, an adaptive routing framework that optimizes the accuracy-efficiency trade-off via principled resource allocation. ODAR uses a difficulty estimator grounded in amortized active inference to dynamically route queries between a heuristic Fast Agent and a deliberative Slow Agent. We further introduce a free-energy-principled, risk-sensitive fusion mechanism that selects answers by minimizing a variational free energy objective, balancing log-likelihood with epistemic uncertainty (varentropy) as a principled alternative to ad hoc voting over heterogeneous candidates. Extensive evaluation across 23 benchmarks shows strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam (HLE), while improving the compute-accuracy frontier under compute-matched settings. We also validate reproducibility on a fully open-source stack (Llama 4 + DeepSeek), where ODAR surpasses homogeneous sampling strategies while reducing computational costs by 82%. Overall, our results suggest that thinking-optimal scaling requires adaptive resource allocation with free-energy-based decision-making rather than simply increasing test-time compute.

Problem

Research questions and friction points this paper is trying to address.

LLM reasoning

test-time compute scaling

adaptive routing

resource allocation

overthinking

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive routing

active inference

free energy principle