🤖 AI Summary
This work addresses the absence of a localized evaluation framework for large language models (LLMs) in the Indian judicial context. We introduce the first benchmark grounded in authentic Indian national and state-level legal examinations—including multiple-choice questions and extended-response tasks from the Supreme Court Advocate-on-Record Examination. Methodologically, we employ multi-round standardized testing coupled with double-blind human evaluation, establishing real legal exams as the primary metric for assessing LLMs’ “judicial readiness.” Our contributions are threefold: (1) the first India-specific legal reasoning evaluation framework; (2) empirical identification of systematic deficiencies in procedural compliance, citation norms, and courtroom-appropriate expression; and (3) characterization of three distinct failure modes that delineate the boundary between AI-assisted and human-led legal reasoning. Experimental results show that state-of-the-art models match top human examinees on objective items but lag substantially behind the highest-scoring human candidates on long-form legal reasoning tasks.
📝 Abstract
Large language models (LLMs) are entering legal workflows, yet we lack a jurisdiction-specific framework to assess their baseline competence therein. We use India's public legal examinations as a transparent proxy. Our multi-year benchmark assembles objective screens from top national and state exams and evaluates open and frontier LLMs under real-world exam conditions. To probe beyond multiple-choice questions, we also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court's Advocate-on-Record exam. This is, to our knowledge, the first exam-grounded, India-specific yardstick for LLM court-readiness released with datasets and protocols. Our work shows that while frontier systems consistently clear historical cutoffs and often match or exceed recent top-scorer bands on objective exams, none surpasses the human topper on long-form reasoning. Grader notes converge on three reliability failure modes: procedural or format compliance, authority or citation discipline, and forum-appropriate voice and structure. These findings delineate where LLMs can assist (checks, cross-statute consistency, statute and precedent lookups) and where human leadership remains essential: forum-specific drafting and filing, procedural and relief strategy, reconciling authorities and exceptions, and ethical, accountable judgment.