TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the issue of hallucination in large language models when confronted with uncertain queries, stemming from their inability to effectively abstain from answering. To mitigate this, we propose a trajectory-aware abstention learning method that operates within the Group Relative Policy Optimization (GRPO) framework. By leveraging multiple response trajectories as intrinsic signals for abstention, our approach dynamically reweights abstention rewards to better explore the model’s knowledge boundaries and enhance the consistency of its refusal behavior. We introduce a novel evaluation paradigm tailored for abstention learning and validate our method on the AbstentionBench benchmark, achieving state-of-the-art abstention F1 scores in five out of six categories and outperforming a static ternary reward baseline on 17 of 31 datasets, all while fully preserving original task accuracy.

📝 Abstract

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.

Problem

Research questions and friction points this paper is trying to address.

abstention learning

hallucination reduction

large language models

trajectory-informed reward

confidence calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-Informed Advantage Reweighting

Abstention Learning

Group Relative Policy Optimization