Interactive Critique-Revision Training for Reliable Structured LLM Generation

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of ensuring local correctness, global consistency, and auditability in structured decision-making tasks such as form filling, where existing approaches fall short of providing reliable guarantees. The authors propose a generator–verifier dual-role game-theoretic framework: the verifier articulates claims, arguments, and evidence via Safety Assurance Cases (SACs), prompting the generator to decide whether to revise its output. Strategy optimization is achieved through paired counterfactual action groups. They introduce Dual Paired-Action Group-Relative Policy Optimization (DPA-GRPO), the first algorithm integrating structured verification interventions with paired-action mechanisms, enabling role-specific KL-regularized policy updates and accompanied by a theoretical convergence analysis of the game dynamics. Experiments on TaxCalcBench TY24 demonstrate substantial improvements over zero-shot and generator-only RL baselines, significantly enhancing decision accuracy, silent acceptance rate, and revision calibration while reducing undetected errors for Qwen3-4B/8B models.

📝 Abstract

In structured decision-making workflows such as form filling, compliance checking, and maintenance reporting, LLM outputs must be locally correct, globally consistent, and auditable against task-specific rules. Existing refinement methods often rely on heuristic debate, self-play, or LLM-generated supervision, creating a second-order assurance problem. We propose DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization), a paired-action training method for a two-player generator--verifier game with structured verifier interventions. The generator proposes outputs and may revise them when challenged; the verifier either remains silent or raises a safety assurance case (SAC) containing a claim, argument, and evidence. These SAC/no-SAC and KEEP/REVISE decisions induce paired counterfactual action groups, which DPA-GRPO uses for role-specific KL-regularized GRPO updates. We analyze the unregularized game and show that positive probability on strictly lower-reward intervention or revision actions creates a profitable unilateral deviation. Under standard stochastic-approximation assumptions, DPA-GRPO tracks the corresponding game ODE, whose isolated asymptotically stable limit points are stationary and candidate local equilibria under role-wise local optimality. Experiments on TaxCalcBench TY24 show that DPA-GRPO improves structured decision accuracy over zero-shot generation and generator-only RL baselines across Qwen3-4B and Qwen3-8B. Training increases correct silent acceptance, reduces missed errors, and improves calibrated revision behavior, indicating gains for both generator and verifier.

Problem

Research questions and friction points this paper is trying to address.

structured generation

reliability

auditable AI

LLM alignment

decision-making workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

DPA-GRPO

structured LLM generation

generator-verifier game