Interactive Data Harmonization with LLM Agents

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Standardizing multi-source heterogeneous clinical data remains challenging due to schema misalignment, terminological heterogeneity, and variability in data collection practices. Method: This paper proposes a novel interactive data harmonization paradigm powered by LLM-based agents, integrating domain expert knowledge with large language model reasoning. Through an interactive UI, users progressively construct harmonization pipelines, supported by core components including schema mapping, semantic alignment, and a standardized primitive library. Contribution/Results: Unlike end-to-end black-box approaches, our work introduces the first human-in-the-loop, incrementally generated, and on-demand reusable pipeline construction mechanism. Experiments demonstrate a ~70% reduction in manual coding effort, significantly improved harmonization consistency and reproducibility, and empirically validated effectiveness and generalizability on real-world clinical datasets.

Technology Category

Application Category

📝 Abstract

Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.

Problem

Research questions and friction points this paper is trying to address.

Automating data harmonization processes

Integrating datasets from diverse sources

Creating reusable data mapping pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based reasoning integration

Interactive user interface design

Data harmonization primitives library

🔎 Similar Papers

CleanAgent: Automating Data Standardization with LLM-based Agents