π€ AI Summary
In electronic design automation (EDA), the scarcity of high-quality annotated data severely limits the accuracy and domain expertise of open-source large language models (LLMs) in retrieval-augmented generation (RAG). To address this, we propose RAFTβan EDA-oriented, synthetic-data-driven retrieval-augmented fine-tuning framework. RAFT introduces retrieval-augmented few-shot (RAFS) synthesis, the first method to generate high-fidelity question-answer pairs grounded in real user queries. It further integrates fine-grained access control with model memory analysis to enforce strict permission isolation for sensitive design data and mitigate privacy leakage risks. Experiments demonstrate significant improvements in LLM accuracy on EDA tasks such as design verification. Crucially, synthetic data proves effective in substituting scarce human annotations, establishing a reusable technical pathway for adapting LLMs to vertical domains.
π Abstract
Electronic design engineers often struggle to efficiently access relevant information for tasks like design verification and technology development. While large language models (LLMs) can enhance productivity as conversational agents, pre-trained open-source LLMs lack domain-specific knowledge for Electronic Design Automation (EDA). In a Retrieval-Augmented Generation (RAG) context, LLMs rely on external context but may still produce inaccurate responses. Retrieval-Augmented Fine-Tuning (RAFT) improves LLM performance, but acquiring labeled question/answer (Q/A) data in EDA is difficult. To address this, we propose using synthetic Q/A datasets to enhance LLMs with RAFT. Our results show that RAFT with synthetic data significantly boosts LLM performance for RAG-based EDA tasks. We also investigate the impact of using real user questions as Retrieval-Augmented Few-Shot (RAFS) examples for synthetic data generation. Additionally, we implement secure access control to ensure sensitive information is only accessible to authorized personnel. Finally, we assess the risk of data leakage and unintended memorization during fine-tuning with synthetic data, providing practical insights.