Interactive Data Harmonization with LLM Agents

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
Standardizing multi-source heterogeneous clinical data remains challenging due to schema misalignment, terminological heterogeneity, and variability in data collection practices. Method: This paper proposes a novel interactive data harmonization paradigm powered by LLM-based agents, integrating domain expert knowledge with large language model reasoning. Through an interactive UI, users progressively construct harmonization pipelines, supported by core components including schema mapping, semantic alignment, and a standardized primitive library. Contribution/Results: Unlike end-to-end black-box approaches, our work introduces the first human-in-the-loop, incrementally generated, and on-demand reusable pipeline construction mechanism. Experiments demonstrate a ~70% reduction in manual coding effort, significantly improved harmonization consistency and reproducibility, and empirically validated effectiveness and generalizability on real-world clinical datasets.

Technology Category

Application Category

📝 Abstract
Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.
Problem

Research questions and friction points this paper is trying to address.

Automating data harmonization processes
Integrating datasets from diverse sources
Creating reusable data mapping pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based reasoning integration
Interactive user interface design
Data harmonization primitives library
🔎 Similar Papers
💼 Related Jobs
AI Data Engineer--LLMs / Agentic Systems
Pfizer
The annual base salary for this position ranges from $106,000.00 to $176,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 15.0% of the base salary and eligibility to participate in our share based long term incentive program. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
United States - Massachusetts - Cambridge