🤖 AI Summary
To address privacy and transparency risks—particularly unintended leakage of confidential data (e.g., intellectual property, proprietary datasets)—in LLM-driven scientific tools, this paper introduces DataShield: the first unified framework integrating privacy leak detection, interpretable privacy policy analysis, and interactive data lineage visualization. Methodologically, it combines rule-based engines, a lightweight NER model, policy text summarization, and dynamic lineage graph rendering. Evaluated on real-world scientific toolchains, DataShield achieves 92% accuracy in identifying sensitive data leaks. A user study with domain scientists shows that 87% report significantly improved awareness of privacy risks and greater confidence in data-handling decisions. This work is the first to jointly enforce policy alignment, scientist-centered decision support, and regulatory compliance within LLM-based research infrastructure—establishing a practical, deployable pathway toward trustworthy scientific AI.
📝 Abstract
As Large Language Models (LLMs) become integral to scientific workflows, concerns over the confidentiality and ethical handling of confidential data have emerged. This paper explores data exposure risks through LLM-powered scientific tools, which can inadvertently leak confidential information, including intellectual property and proprietary data, from scientists' perspectives. We propose"DataShield", a framework designed to detect confidential data leaks, summarize privacy policies, and visualize data flow, ensuring alignment with organizational policies and procedures. Our approach aims to inform scientists about data handling practices, enabling them to make informed decisions and protect sensitive information. Ongoing user studies with scientists are underway to evaluate the framework's usability, trustworthiness, and effectiveness in tackling real-world privacy challenges.