The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work investigates whether JavaScript code generated by large language models (LLMs) exhibits model-specific, identifiable “fingerprints” to enable fine-grained model attribution. To this end, we construct LLMS-NodeJS, a large-scale dataset comprising 250,000 samples. We propose JSIR—a structured intermediate representation combining JavaScript Intermediate Representation (JSIR) and Abstract Syntax Trees (ASTs)—and design CodeT5-JSA, a hybrid classifier integrating traditional machine learning with Transformer-based architectures. Our empirical study is the first to demonstrate that LLM-generated JavaScript exhibits stable, robust structural biases—discernible even across models within the same family or after code obfuscation and transformation. In five-class, ten-class, and twenty-class attribution tasks, CodeT5-JSA achieves accuracies of 95.8%, 94.6%, and 88.5%, respectively—significantly outperforming baselines including BERT and CodeBERT. This advances AI-generated code detection beyond the conventional binary “human vs. machine” paradigm toward precise model-level attribution.

Technology Category

Application Category

📝 Abstract

In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (LLMs) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size. To this end, we introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models. Each has four transformed variants, yielding 250,000 unique JavaScript samples and two additional representations (JSIR and AST) for diverse research applications. Using this dataset, we benchmark traditional machine learning classifiers against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom architecture derived from the 770M-parameter CodeT5 model with its decoder removed and a modified classification head. It achieves 95.8% accuracy on five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks, surpassing other tested models such as BERT, CodeBERT, and Longformer. We demonstrate that classifiers capture deeper stylistic regularities in program dataflow and structure, rather than relying on surface-level features. As a result, attribution remains effective even after mangling, comment removal, and heavy code transformations. To support open science and reproducibility, we release the LLM-NodeJS dataset, Google Colab training scripts, and all related materials on GitHub: https://github.com/LLM-NodeJS-dataset.

Problem

Research questions and friction points this paper is trying to address.

Identifying which LLM generated JavaScript code through authorship attribution

Detecting unique stylistic signatures in AI-generated code across different models

Enabling reliable model fingerprinting despite code transformations and obfuscation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom CodeT5-JSA architecture for code attribution

Dataset with transformed variants and multiple representations

Captures structural patterns beyond surface-level features

🔎 Similar Papers

Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges