🤖 AI Summary
Existing programming feedback generation systems lack holistic evaluation across quality, cost, latency, and data privacy—critical dimensions for pedagogical deployment.
Method: We propose the first systematic evaluation framework for programming feedback models along these four axes. Our approach integrates WebLLM with 4-bit quantized Llama3-8B or Phi3-3.8B models in a lightweight browser-based system, and introduces GPT-4–synthesized data for efficient fine-tuning tailored to edge execution.
Contribution/Results: Evaluated on CodeHelp, DS100, and PyBench—three Python programming benchmarks—the fine-tuned small models achieve teacher-level feedback quality (mean accuracy >89%), end-to-end latency under 2 seconds, zero server dependency, and zero upload of raw student code—ensuring strong privacy guarantees without compromising fidelity or efficiency. This demonstrates a viable path toward high-quality, low-cost, low-latency, and privacy-preserving AI-assisted programming education.
📝 Abstract
Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback to achieve human tutors' quality. While quality is an important performance criterion, it is not the only criterion to optimize for real-world educational deployments. In this paper, we benchmark language models for programming feedback generation across several performance criteria, including quality, cost, time, and data privacy. The key idea is to leverage recent advances in the new paradigm of in-browser inference that allow running these models directly in the browser, thereby providing direct benefits across cost and data privacy. To boost the feedback quality of small models compatible with in-browser inference engines, we develop a fine-tuning pipeline based on GPT-4 generated synthetic data. We showcase the efficacy of fine-tuned Llama3-8B and Phi3-3.8B 4-bit quantized models using WebLLM's in-browser inference engine on three different Python programming datasets. We will release the full implementation along with a web app and datasets to facilitate further research on in-browser language models.