🤖 AI Summary
This work addresses the need for verifiable ranking outputs in decision support systems by introducing the Evidence-Certified Candidate Ranking (ECCR) task, which jointly optimizes ranking and evidence generation to ensure that cited text segments are sufficient to reproduce the final decision. To this end, the authors propose ECPO, a listwise policy optimization framework that integrates skeleton alignment rewards, argument consistency constraints, graph-based features, and an evidence loop reward mechanism. They also introduce CertNDCG—a novel evaluation metric—and an unsupervised certainty verifier to enforce coherence between decisions and their supporting evidence. Experiments on the MAVEN-ERE and RAMS datasets demonstrate that the proposed approach significantly outperforms zero-shot, supervised fine-tuning (SFT), and GRPO baselines, achieving state-of-the-art CertNDCG performance across diverse candidate configurations.
📝 Abstract
Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.