FinReflectKG - EvalBench: Benchmarking Financial KG with Multi-Dimensional Evaluation

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of a standardized evaluation benchmark hinders rigorous assessment of financial knowledge graph (KG) construction. Method: This paper introduces the first structured knowledge extraction benchmark specifically for SEC 10-K filings, featuring a multidimensional evaluation framework—faithfulness, precision, relevance, and completeness—and a novel adjudication mechanism integrating proxy-based reflective extraction with bias mitigation. It employs a post-submission verification protocol and a hybrid scoring scheme combining binary and ordinal classification. Contribution/Results: We empirically validate large language models as reliable, interpretable, and cost-effective evaluators. A systematic comparison of extraction paradigms—single-pass, multi-pass, and reflective—reveals that reflective extraction achieves the best overall performance, while single-pass excels in faithfulness. This work advances standardization, transparency, and governability in financial AI evaluation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly being used to extract structured knowledge from unstructured financial text. Although prior studies have explored various extraction methods, there is no universal benchmark or unified evaluation framework for the construction of financial knowledge graphs (KG). We introduce FinReflectKG - EvalBench, a benchmark and evaluation framework for KG extraction from SEC 10-K filings. Building on the agentic and holistic evaluation principles of FinReflectKG - a financial KG linking audited triples to source chunks from S&P 100 filings and supporting single-pass, multi-pass, and reflection-agent-based extraction modes - EvalBench implements a deterministic commit-then-justify judging protocol with explicit bias controls, mitigating position effects, leniency, verbosity and world-knowledge reliance. Each candidate triple is evaluated with binary judgments of faithfulness, precision, and relevance, while comprehensiveness is assessed on a three-level ordinal scale (good, partial, bad) at the chunk level. Our findings suggest that, when equipped with explicit bias controls, LLM-as-Judge protocols provide a reliable and cost-efficient alternative to human annotation, while also enabling structured error analysis. Reflection-based extraction emerges as the superior approach, achieving best performance in comprehensiveness, precision, and relevance, while single-pass extraction maintains the highest faithfulness. By aggregating these complementary dimensions, FinReflectKG - EvalBench enables fine-grained benchmarking and bias-aware evaluation, advancing transparency and governance in financial AI applications.
Problem

Research questions and friction points this paper is trying to address.

Establishing benchmark for financial knowledge graph extraction from SEC filings
Evaluating extraction methods with bias-controlled multi-dimensional metrics
Comparing single-pass, multi-pass and reflection-agent based KG construction approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework implements deterministic commit-then-justify judging protocol
Triples evaluated using binary judgments across three dimensions
Reflection-based extraction emerges as superior comprehensive approach
🔎 Similar Papers
No similar papers found.
F
Fabrizio Dimino
Domyn, New York, US
A
Abhinav Arun
Domyn, New York, US
Bhaskarjit Sarmah
Bhaskarjit Sarmah
Domyn
Machine LearningGenerative AIAgentic AIResponsible AI
S
Stefano Pasquali
Domyn, New York, US