Are Sparse Autoencoders Useful for Java Function Bug Detection?

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the problem of detecting Java function-level vulnerabilities—such as buffer overflows and SQL injection—without requiring fine-tuning or labeled supervision. We propose a lightweight, interpretable method that employs sparse autoencoders (SAEs) to unsupervisedly extract vulnerability-specific features directly from frozen intermediate-layer representations of pretrained LLMs (GPT-2 Small and Gemma 2B). To our knowledge, this is the first empirical demonstration that SAEs can generalize to identify software defects from internal LLM activations, breaking away from conventional fine-tuning paradigms. Our approach integrates static code embeddings with binary classification evaluation, achieving up to 89% F1-score on function-level vulnerability detection—substantially outperforming fine-tuned Transformer encoder baselines. Moreover, it enables neuron-level attribution and facilitates interpretable analysis of model reasoning.

Technology Category

Application Category

📝 Abstract

Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision.

Problem

Research questions and friction points this paper is trying to address.

Detect Java function bugs using Sparse Autoencoders

Reduce false positives in vulnerability detection methods

Improve interpretability of AI-based bug detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders for Java bug detection

Lightweight interpretable alternative to LLMs

Bug detection from pretrained LLMs without fine-tuning

🔎 Similar Papers

No similar papers found.