Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) inference often suffers from inefficiency due to redundant token generation; existing reinforcement learning (RL)-based approaches prioritize accuracy at the expense of output conciseness, and while length penalties improve efficiency, they severely degrade accuracy. To address this trade-off, we propose a dynamic, saliency-aware RL framework featuring a novel token-level saliency assessment mechanism that drives an adaptive length reward. This reward is coupled with a dynamic decay strategy to progressively prune non-critical tokens during decoding. Crucially, our method achieves significant output length compression without sacrificing accuracy. Extensive experiments across multiple reasoning benchmarks demonstrate simultaneous improvements in both accuracy and inference efficiency—achieving Pareto-optimal gains over state-of-the-art RL baselines.

Technology Category

Application Category

📝 Abstract
Large language models have demonstrated impressive reasoning capabilities, yet they often suffer from inefficiencies due to unnecessarily verbose or redundant outputs. While many works have explored reinforcement learning (RL) to enhance reasoning abilities, most primarily focus on improving accuracy, with limited attention to reasoning efficiency. Some existing approaches introduce direct length-based rewards to encourage brevity, but this often leads to noticeable drops in accuracy. In this paper, we propose Bingo, an RL framework that advances length-based reward design to boost efficient reasoning. Bingo incorporates two key mechanisms: a significance-aware length reward, which gradually guides the model to reduce only insignificant tokens, and a dynamic length reward, which initially encourages elaborate reasoning for hard questions but decays over time to improve overall efficiency. Experiments across multiple reasoning benchmarks show that Bingo improves both accuracy and efficiency. It outperforms the vanilla reward and several other length-based reward baselines in RL, achieving a favorable trade-off between accuracy and efficiency. These results underscore the potential of training LLMs explicitly for efficient reasoning.
Problem

Research questions and friction points this paper is trying to address.

Improving reasoning efficiency in large language models
Reducing verbose outputs without sacrificing accuracy
Balancing accuracy and efficiency via dynamic RL rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Significance-aware length reward reduces insignificant tokens
Dynamic length reward adapts to question difficulty
RL framework balances accuracy and efficiency