HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In single-cell clustering analysis, resolution selection and GO functional annotation rely heavily on subjective expertise, lacking objective, quantitative criteria. To address this, we propose the first LLM-driven intelligent agent framework that formalizes functional annotation as an optimization problem involving hypothesis generation and validation: a large language model generates candidate GO term hypotheses, while sentence embeddings quantify intra-cluster consistency and inter-cluster separation to construct a differentiable resolution scoring function. This approach eliminates heuristic parameter tuning, enabling automated resolution selection and objective, reproducible annotation. Evaluated on K562 Perturb-seq data, our score significantly outperforms conventional metrics—including silhouette coefficient and modularity—and better aligns with known biological pathways. The framework establishes a novel, interpretable, and reproducible paradigm for functional interpretation of single-cell data.

Technology Category

Application Category

📝 Abstract
Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene-Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene-set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO-based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence-embedding model, compute pair-wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra-cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter-cluster separation. These two quantities are combined to produce an agent-derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context-aware interpretation pipelines in single-cell multi-omics studies.
Problem

Research questions and friction points this paper is trying to address.

Automates subjective gene-set cluster resolution selection
Quantifies functional annotation using LLM-generated hypotheses
Optimizes clustering granularity via intra-cluster and inter-cluster metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven framework for cluster annotation
Generates ranked GO hypotheses with confidence scores
Agent-derived resolution score maximizes coherence and separation
🔎 Similar Papers
No similar papers found.
Ying Yuan
Ying Yuan
Carnegie Mellon University
Robot learning
X
Xing-Yue Monica Ge
Computational Sciences-Center of Excellence, Genentech, South San Francisco, CA, USA
A
Aaron Archer Waterman
Computational Sciences-Center of Excellence, Genentech, South San Francisco, CA, USA
Tommaso Biancalani
Tommaso Biancalani
Genentech
machine learningcomputational biologydrug discovery
David Richmond
David Richmond
AI and Machine Learning Scientist
computer vision for biomedical images
Y
Yogesh Pandit
Computational Sciences-Center of Excellence, Genentech, South San Francisco, CA, USA
A
Avtar Singh
Department of Cell and Tissue Genomics, Genentech, South San Francisco, CA, USA
R
Russell Littman
Genentech Research & Early Development (gRED), Genentech, South San Francisco, CA, USA
J
Jin Liu
Computational Sciences-Center of Excellence, Genentech, South San Francisco, CA, USA
J
Jan-Christian Huetter
Computational Sciences-Center of Excellence, Genentech, South San Francisco, CA, USA
V
Vladimir Ermakov
Computational Sciences-Center of Excellence, Genentech, South San Francisco, CA, USA