🤖 AI Summary
Existing large language models (LLMs) exhibit critical deficiencies in Korean financial contexts—including insufficient domain-specific knowledge, weak legal reasoning capabilities, and poor detection of financial toxicity—while lacking a systematic, domain-specific evaluation benchmark. Method: We introduce KoFinBench, the first multidimensional evaluation benchmark for Korean financial AI, covering three high-stakes tasks: financial knowledge QA, legal clause reasoning, and financial toxicity identification. We propose a hybrid data construction paradigm combining GPT-4–assisted semi-automatic generation with rigorous domain-expert validation. Contribution/Results: Our empirical evaluation across 10+ LLMs reveals a previously undocumented family-level trade-off between accuracy and safety in financial language understanding. KoFinBench enables reproducible, interpretable early-stage assessment, effectively diagnosing model weaknesses and advancing trustworthy Korean financial AI development.
📝 Abstract
We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.