LLM DNA: Tracing Model Evolution via Functional Representations

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The rapid proliferation of large language models (LLMs) has outpaced systematic tracking of their evolutionary relationships—e.g., via fine-tuning, distillation, or adaptation—resulting in unrecorded, non-reproducible lineages. Existing analysis methods suffer from task specificity, reliance on fixed model sets, or assumptions about architecture or tokenization. Method: We propose LLM DNA—a low-dimensional, bi-Lipschitz functional behavioral representation—defining the first heritable, genetically deterministic behavioral fingerprint for LLMs, agnostic to architecture and tokenizer. Our training-free DNA extraction pipeline, combined with phylogenetic inference, constructs the first comprehensive LLM phylogeny spanning 305 models. Contribution/Results: Experiments demonstrate that LLM DNA accurately recovers known lineage relationships, discovers previously unknown evolutionary paths, and quantitatively captures architectural shifts, temporal trends, and divergent family-level evolutionary rates.

Technology Category

Application Category

📝 Abstract
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
Problem

Research questions and friction points this paper is trying to address.

Tracing undocumented evolutionary relationships among millions of LLMs
Overcoming limitations of task-specific methods and rigid assumptions
Establishing functional DNA representations to reconstruct LLM evolutionary trees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defining LLM DNA as functional behavior representation
Developing training-free pipeline for DNA extraction
Constructing evolutionary tree using phylogenetic algorithms
🔎 Similar Papers
No similar papers found.
Zhaomin Wu
Zhaomin Wu
Research Fellow at NUS
Trustworthy AIFederated LearningMachine Unlearning
Haodong Zhao
Haodong Zhao
Shanghai Jiao Tong University
Federated LearningLLM
Z
Ziyang Wang
Department of Computer Science, National University of Singapore
Jizhou Guo
Jizhou Guo
Shanghai Jiao Tong University
Large Language ModelsFoundation ModelsNatural Language Processing
Q
Qian Wang
Department of Computer Science, National University of Singapore
B
Bingsheng He
Department of Computer Science, National University of Singapore