Research Scientist, Interpretability

About the job

The Interpretability team at Anthropic is working to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to make advanced systems safe. We’re looking for researchers and engineers to join our efforts. We're focused on mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms.

Responsibilities

Develop methods for understanding LLMs by reverse engineering algorithms learned in their weights

Design and run robust experiments, both quickly in toy scenarios and at scale in large models

Create and analyze new interpretability features and circuits to better understand how models work.

Build infrastructure for running experiments and visualizing results

Work with colleagues to communicate results internally and publicly

Qualifications

Minimum

Have a strong track record of scientific research (in any field), and have done some work on Interpretability

Enjoy team science – working collaboratively to make big discoveries

Are comfortable with messy experimental science. We're inventing the field as we work, and the first textbook is years away

You view research and engineering as two sides of the same coin. Every team member writes code, designs and runs experiments, and interprets results

You can clearly articulate and discuss the motivations behind your work, and teach us about what you

Preferred

You may be a good fit if you:

Have a strong track record of scientific research (in any field), and have done some work on Interpretability

Enjoy team science – working collaboratively to make big discoveries

Are comfortable with messy experimental science. We're inventing the field as we work, and the first textbook is years away

You view research and engineering as two sides of the same coin. Every team member writes code, designs and runs experiments, and interprets results

You can clearly articulate and discuss the motivations behind your work, and teach us about what you've learned. You like writing up and communicating your results, even when they're null