🤖 AI Summary
This work addresses the high annotation cost and hallucination risks inherent in automated approaches for structured information extraction from large-scale historical documents. The authors propose a two-tier human-in-the-loop annotation framework that requires neither distributional priors nor task-specific calibration. Two architecturally distinct multimodal large language models perform parallel annotations; consensus outputs are automatically accepted, while disagreements are resolved through human arbitration. A second-tier cross-validation mechanism further reduces expert involvement by escalating only residual conflicts to domain specialists. Evaluated on a French medical directory dataset spanning 1887–1906, the method achieves a word error rate of 0.003, with over 85% of fields confirmed automatically via model consensus. This approach establishes the first structured extraction benchmark for Rosenwald directories, substantially reducing manual intervention while enhancing system autonomy as model capabilities improve.
📝 Abstract
Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.