🤖 AI Summary
Standardizing multi-source heterogeneous clinical data remains challenging due to schema misalignment, terminological heterogeneity, and variability in data collection practices. Method: This paper proposes a novel interactive data harmonization paradigm powered by LLM-based agents, integrating domain expert knowledge with large language model reasoning. Through an interactive UI, users progressively construct harmonization pipelines, supported by core components including schema mapping, semantic alignment, and a standardized primitive library. Contribution/Results: Unlike end-to-end black-box approaches, our work introduces the first human-in-the-loop, incrementally generated, and on-demand reusable pipeline construction mechanism. Experiments demonstrate a ~70% reduction in manual coding effort, significantly improved harmonization consistency and reproducibility, and empirically validated effectiveness and generalizability on real-world clinical datasets.
📝 Abstract
Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.