Scholar

Adrià Garriga-Alonso

Google Scholar ID: OtnThiMAAAAJ

Research Scientist, FAR AI

AI safetyinterpretability

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

3,482

H-index

i10-index

Publications

Co-authors

list available

Contact

Emailadria.garriga@gmail.com CVOpen ↗TwitterOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

12 items

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

2026

Cited

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

2026

Cited

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

2026

Cited

DiFR: Inference Verification Despite Nondeterminism

2025

Cited

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

2025

Cited

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

2025

Cited

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

2025

Cited

Interpreting Emergent Planning in Model-Free Reinforcement Learning

2025

Cited

Resume (English only)

Academic Achievements

Published multiple papers, such as 'Towards Automatic Circuit Discovery for Mechanistic Interpretability' and 'Causal Scrubbing: a Method for Rigorously Testing Interpretability Hypotheses'.

Research Experience

Currently a Research Scientist at FAR AI. Previously worked at Redwood Research on interpretability research and software development.

Education

Holds a PhD in machine learning from the University of Cambridge, advised by Prof. Carl Rasmussen. His research focused on improving uncertainty quantification in neural networks using Bayesian principles.

Background

Research interests include how neural networks work internally, evaluating the accuracy of interpretability explanations, finding algorithmic explanations at lower labor and compute costs, and understanding the behavior and motivations of agent-like AIs. The goal is to ensure that AI is beneficial to society.

Miscellany