🤖 AI Summary
Large language models (LLMs) exhibit hallucination and low reliability in developer code question answering—especially for novel queries lacking historical similar matches. Method: We construct a retrieval corpus of over 3 million Java/Python Stack Overflow posts and propose an adaptive HyDE retrieval mechanism: when historical similarity falls below a threshold, it dynamically lowers the retrieval threshold and jointly embeds hypothesized documents with full answer context to enhance contextual recall and generalization for unseen questions. Integrated within a RAG framework, our approach combines vector retrieval with LLM-as-a-judge evaluation and conducts zero-shot comparative experiments across multiple open-source LLMs. Contribution/Results: The method significantly improves answer usefulness, correctness, and detail. It demonstrates robust cross-model performance and markedly enhances support for novel development questions.
📝 Abstract
Large Language Models (LLMs) have shown promise in assisting developers with code-related questions; however, LLMs carry the risk of generating unreliable answers. To address this, Retrieval-Augmented Generation (RAG) has been proposed to reduce the unreliability (i.e., hallucinations) of LLMs. However, designing effective pipelines remains challenging due to numerous design choices. In this paper, we construct a retrieval corpus of over 3 million Java and Python related Stack Overflow posts with accepted answers, and explore various RAG pipeline designs to answer developer questions, evaluating their effectiveness in generating accurate and reliable responses. More specifically, we (1) design and evaluate 7 different RAG pipelines and 63 pipeline variants to answer questions that have historically similar matches, and (2) address new questions without any close prior matches by automatically lowering the similarity threshold during retrieval, thereby increasing the chance of finding partially relevant context and improving coverage for unseen cases. We find that implementing a RAG pipeline combining hypothetical-documentation-embedding (HyDE) with the full-answer context performs best in retrieving and answering similarcontent for Stack Overflow questions. Finally, we apply our optimal RAG pipeline to 4 open-source LLMs and compare the results to their zero-shot performance. Our findings show that RAG with our optimal RAG pipeline consistently outperforms zero-shot baselines across models, achieving higher scores for helpfulness, correctness, and detail with LLM-as-a-judge. These findings demonstrate that our optimal RAG pipelines robustly enhance answer quality for a wide range of developer queries including both previously seen and novel questions across different LLMs