🤖 AI Summary
This work addresses the inefficiency of large language models (LLMs) that often expend excessive computational resources on over-reasoning for simple tasks, struggling to balance efficiency and accuracy. To tackle this, we propose Ada-RS, an adaptive rejection sampling framework that introduces rejection sampling into reasoning path selection for the first time. Ada-RS evaluates multiple sampled paths using a length-penalized reward and stochastically rejects low-value paths, retaining only high-quality candidates for downstream preference optimization. This approach dynamically balances reasoning depth and computational efficiency while remaining compatible with preference alignment algorithms such as DPO and DAPO, and integrates seamlessly with LoRA-based fine-tuning. Experiments on Qwen3-8B demonstrate that Ada-RS reduces output token usage by up to 80% and cuts reasoning overhead by up to 95%, all while maintaining or even improving tool-calling accuracy.
📝 Abstract
Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to 80% and reducing thinking rate by up to 95% while maintaining or improving tool call accuracy. These results highlight that training-signal selection is a powerful lever for efficient reasoning in latency-sensitive deployments.