LLM DNA: Tracing Model Evolution via Functional Representations

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

The rapid proliferation of large language models (LLMs) has outpaced systematic tracking of their evolutionary relationships—e.g., via fine-tuning, distillation, or adaptation—resulting in unrecorded, non-reproducible lineages. Existing analysis methods suffer from task specificity, reliance on fixed model sets, or assumptions about architecture or tokenization. Method: We propose LLM DNA—a low-dimensional, bi-Lipschitz functional behavioral representation—defining the first heritable, genetically deterministic behavioral fingerprint for LLMs, agnostic to architecture and tokenizer. Our training-free DNA extraction pipeline, combined with phylogenetic inference, constructs the first comprehensive LLM phylogeny spanning 305 models. Contribution/Results: Experiments demonstrate that LLM DNA accurately recovers known lineage relationships, discovers previously unknown evolutionary paths, and quantitatively captures architectural shifts, temporal trends, and divergent family-level evolutionary rates.

Technology Category

Application Category

📝 Abstract

The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.

Problem

Research questions and friction points this paper is trying to address.

Tracing undocumented evolutionary relationships among millions of LLMs

Overcoming limitations of task-specific methods and rigid assumptions

Establishing functional DNA representations to reconstruct LLM evolutionary trees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defining LLM DNA as functional behavior representation

Developing training-free pipeline for DNA extraction

Constructing evolutionary tree using phylogenetic algorithms

🔎 Similar Papers

Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks