🤖 AI Summary
This work addresses the challenge of aligning large language models for high-stakes domains like insurance, where strict regulatory compliance, minimal hallucination, and strong general capabilities must coexist—a balance rarely achieved by existing approaches. The authors propose an end-to-end alignment paradigm that integrates verifiable data synthesis, dynamic data annealing, and a progressive SFT-RL curriculum framework combining RLVR and RLAIF to train INS-S1, a domain-specialized insurance model. Evaluated on INSEva—the most comprehensive insurance benchmark to date—INS-S1 achieves state-of-the-art performance, surpassing strong general-purpose models such as DeepSeek-R1 and Gemini-2.5-Pro while maintaining top-tier general abilities and reducing hallucination rates to just 0.6%.
📝 Abstract
Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.