Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Large language models (LLMs) pose potential risks in software engineering due to harmful outputs—e.g., insecure code, malicious scripts, or unethical advice—that threaten development safety and reliability. Method: We propose the first LLM harmfulness evaluation framework tailored to programming contexts. It comprises a software-engineering-specific harmfulness taxonomy, a verifiable automated evaluator, and a curated prompt dataset grounded in realistic coding tasks. We systematically assess over 20 open- and closed-source models—including general-purpose and code-specialized variants—across diverse programming scenarios. Contributions/Results: Key findings include: (1) models like OpenHermes exhibit significantly higher harmfulness; (2) code-specialized models are not inherently safer; (3) certain fine-tuned models increase risk relative to their base counterparts; and (4) larger parameter counts generally improve both safety and utility. The study uncovers non-intuitive relationships among model scale, architecture, and alignment strategies in determining programming-specific harmfulness, providing empirical foundations and an evaluation paradigm for LLM safety governance.

Technology Category

Application Category

📝 Abstract

Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content. The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.

Problem

Research questions and friction points this paper is trying to address.

Assessing harmfulness of LLMs in software engineering tasks

Evaluating alignment strategies for harmless code generation

Comparing harmful content tendencies across different LLM types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develop taxonomy for harmful software scenarios

Create dataset of prompts for LLM assessment

Design automatic evaluator for LLM outputs

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates