π€ AI Summary
This study addresses the challenge of enabling large language models to reliably pass Japanβs highly specialized and format-strict bar examination without altering question types or scoring criteria. To this end, the authors construct a novel dataset that faithfully replicates the authentic structure and grading standards of the exam and propose a single-model architecture incorporating format-faithful supervision and a self-verification mechanism. This approach achieves high-precision legal reasoning through joint judgment across multiple propositions. Notably, it is the first method to surpass the official passing threshold while preserving the original examination setting, substantially outperforming existing strategies such as multi-agent systems and task decomposition. The results underscore the critical role of structural consistency and carefully designed supervision in complex, domain-specific reasoning tasks.
π Abstract
Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true--false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam-level competence. In this paper, we present a self-verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi-agent inference and decomposition-based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format-faithful supervision and consistency verification, and suggest that carefully designed single-model approaches can outperform more complex systems in high-stakes professional reasoning tasks. Our dataset and codes are publicly available.