🤖 AI Summary
To address weak complex reasoning capabilities and low legal citation accuracy in Thai legal question answering, this paper proposes a novel method integrating Retrieval-Augmented Generation (RAG) with Group Relative Policy Optimization (GRPO). Methodologically, it innovatively employs the lightweight BGE-M3 embedding model as a semantic similarity reward signal—replacing computationally expensive large-language-model-based judges—and reduces computational cost by 2.5×. Additionally, instruction tuning is leveraged to establish a joint learning framework for legal semantic matching and policy coordination. Evaluated on the NitiBench benchmark, the approach achieves a 90% improvement in legal citation F1 score and a 31% gain in the composite quality metric over pure instruction tuning. These results demonstrate substantial enhancements in robustness and practicality for multi-step reasoning and precise legal citation tasks in Thai legal QA.
📝 Abstract
The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.