About the job
We are looking for highly motivated interns to join our compute team as a machine learning scientist looking to work at the intersection of machine learning and life sciences for our Summer 2026 cohort. You will partner directly with a team mentor in developing and/or applying ML methods to a process and analyze large scale datasets from multiple modalities over the course of the summer (11-12 weeks). These internships can based in on South San Francisco headquarter with a hybrid work schedule or can be remote based on the team mentor's location and business need.
Responsibilities
- Leverage publicly available single cell transcriptomics resources to extract insights about disease mechanisms relevant to the therapeutic areas;
- Develop, productionize, and deploy cutting-edge ML approaches to integrate large-scale multi-modal phenotypic datasets;
- Develop workflows to enable post-GWAS (Genome-Wide Association Scan) analysis of results, e.g., fine-mapping;
- Translational genetics deep dives: enabling higher throughput annotation and exploration of candidate genes from our discovery efforts;
- Design of statistical methods to improve rare variant burden tests, and methods to improve power in longitudinal phenotypes;
- Develop ML models for imputing disease-relevant phenotypes from high-content clinical imaging datasets, e.g., MRI, PET-CT;
- Develop ML methods for disentangling and genetically interpreting axes of variation in complex phenotypes;
- Use LLMs to extract disease-relevant information from medical records;
- Explore generative models of small molecules, biologics, and/or oligonucleotide therapeutics in various data modalities such as 2D and 3D representations for hit-to-lead drug discovery efforts.;
- Develop new geometric deep learning methods to better characterize nuanced molecular properties and relationships.;
- Identify and prototype novel microscopy-driven phenotyping workflows, including hardware acquisition, post-processing, and featurization;
- Develop robust software tooling to support the deployment of new and existing methods for general use by insitro scientists;
- Optimize existing microscopy acquisition methods in both hardware and software, using ML feature outputs to benchmark improvements
Qualifications
Minimum
- Working towards a BS, MS, or Ph.D. in engineering, computational biology, systems biology, computer science, mathematics, statistics, life science, chemistry, physics, or a related field.;
- Proficiency in one or more general-purpose programming languages. We primarily use Python.;
- Interest in using and developing brand new statistical and machine learning methods inspired by real problems.;
- Curiosity about human physiology or disease biology.;
- Committed to writing high-quality, well-commented code and documentation.;
- Ability to communicate effectively and collaborate with people of diverse backgrounds and job functions.;
- Passion for making a difference in the world.
Preferred
- First-hand experience with biological data, preferably using computational approaches.;
- Passion for learning how to work with diverse functional genomic assays (RNA/DNase/ATAC/ChIP-seq, etc).;
- Interest in learning how to analyze single-cell RNA-seq data.;
- Solid understanding of computational chemistry, including virtual screening (classic QSAR modeling, structure based drug-discovery), library design, etc.;
- Demonstrated ability to use and develop cutting edge statistical and machine learning methods inspired by real problems.;
- Experience with machine and deep Learning frameworks (e.g., scikit-learn, PyTorch, etc.).;
- Demonstrated ability to write high-quality, production-ready code (readable, well-tested, with well-designed APIs).;
- Experience in Linux environment, database languages (e.g., SQL, No-SQL) and version control practices and tools such as Git.;
- Publications of high-quality work in relevant computational biology, bioinformatics, systems biology, life sciences, or biomedical venues, including journals and conferences.;
- Passionate about solving problems, asking questions and learning independently.;
- Familiarity with the SciPy/PyData ecosystem (numpy, pandas, scipy, dask etc.).;
- Familiarity with cloud computing services (AWS or GCP).;
- Familiarity with statistical analysis software, e.g., R.