Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional pipeline-based entity resolution—namely, graph sparsity, missing edges, and error propagation—by introducing Alper, a novel framework that unifies matching and clustering into a single co-optimization task. Alper iteratively performs probabilistic label propagation over a global dynamic graph, effectively integrating weak signals from graph neural propagation with strong signals from large language models (LLMs). It further incorporates an adaptive graph refinement mechanism and employs a budget-constrained greedy algorithm to strategically select high-value LLM queries, ensuring theoretically grounded, cost-effective optimization. Extensive experiments across eight benchmark datasets demonstrate that Alper significantly outperforms existing cascaded approaches, achieving higher accuracy and robustness under limited query budgets.
📝 Abstract
Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.
Problem

Research questions and friction points this paper is trying to address.

Entity Resolution
Dirty Data
Graph Construction
Error Propagation
Clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

entity resolution
label propagation
graph refinement
large language models
cost-effective optimization
🔎 Similar Papers
No similar papers found.