Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study investigates whether references generated by large language models (LLMs) can be effectively distinguished from human-written citations at both structural and semantic levels. The authors construct paired citation graphs that integrate graph neural networks with high-dimensional semantic embeddings—such as SPECTER and OpenAI—and incorporate topological features like node centrality. Their findings reveal that LLM-generated citations closely mimic human references in graph structure, yielding only about 60% detection accuracy when relying solely on structural features. However, detectable semantic discrepancies emerge when semantic embeddings are introduced, boosting GNN-based detection accuracy to 93%. The approach demonstrates robustness across multiple LLMs, including Claude Sonnet 4.5. These results underscore that semantic content signals are more discriminative than global graph topology, offering a promising direction for detecting and mitigating biases in LLM-generated text.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89--0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93\% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated references

citation graphs

semantic bias

bibliography curation

detectability

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated references

semantic embeddings

Graph Neural Networks