Predicting the Past: Estimating Historical Appraisals with OCR and Machine Learning

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Quantifying the long-term impact of 1930s U.S. housing policies—particularly redlining—on racial wealth inequality is hindered by the scarcity of digitized, large-scale historical property appraisal records, most of which exist only as unstructured paper archives. Method: We propose a scalable digital framework for automated extraction and estimation of historical property valuations from scanned archival documents. Our approach adopts a two-stage OCR-regression paradigm: (1) structured attribute recognition via CRNN or Transformer-based OCR; (2) county-transferable supervised regression modeling using building-level features. Contributions/Results: We release the first manually annotated, county-level historical property card dataset (12,000+ cards, 50,000+ OCR-verified attributes); present the first cross-county historical valuation estimates quantifying redlining’s economic effects; open-source a complete county-level valuation dataset; and demonstrate strong generalization—with <8% MAE on unseen counties.

Technology Category

Application Category

📝 Abstract

Despite well-documented consequences of the U.S. government's 1930s housing policies on racial wealth disparities, scholars have struggled to quantify its precise financial effects due to the inaccessibility of historical property appraisal records. Many counties still store these records in physical formats, making large-scale quantitative analysis difficult. We present an approach scholars can use to digitize historical housing assessment data, applying it to build and release a dataset for one county. Starting from publicly available scanned documents, we manually annotated property cards for over 12,000 properties to train and validate our methods. We use OCR to label data for an additional 50,000 properties, based on our two-stage approach combining classical computer vision techniques with deep learning-based OCR. For cases where OCR cannot be applied, such as when scanned documents are not available, we show how a regression model based on building feature data can estimate the historical values, and test the generalizability of this model to other counties. With these cost-effective tools, scholars, community activists, and policy makers can better analyze and understand the historical impacts of redlining.

Problem

Research questions and friction points this paper is trying to address.

Digitizing inaccessible historical property appraisal records

Overcoming physical format barriers for quantitative analysis

Estimating historical values using OCR and regression models

Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR and machine learning digitize historical records

Two-stage approach combines CV and deep learning

Regression model estimates values without scans

🔎 Similar Papers

Chronicling Germany: An Annotated Historical Newspaper Dataset