Scholar

Satvik Golechha

Google Scholar ID: N-j6EO8AAAAJ

Research Scientist, AISI

AGI securityalignmentinterpretabilityreinforcement learning

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

110

H-index

i10-index

Publications

Co-authors

list available

Contact

Emailzsatvik@gmail.com CVOpen ↗TwitterOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

9 items

Building Better Deception Probes Using Targeted Instruction Pairs

2026

Cited

ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

2025

Cited

Auditing Games for Sandbagging

2025

Cited

Who's the Evil Twin? Differential Auditing for Undesired Behavior

2025

Cited

Among Us: A Sandbox for Agentic Deception

2025

Cited

Auditing language models for hidden objectives

2025

Cited

Modular Training of Neural Networks aids Interpretability

2025

Cited

Training Neural Networks for Modularity aids Interpretability

arXiv.org · 2024

Cited

Resume (English only)

Academic Achievements

Published multiple papers in top conferences such as NeurIPS, ICLR, and ICML. Specific works include:
- ABBEL: Acting via Belief Bottlenecks Expressed in Language
- Among Us: A Sandbox for Measuring and Detecting Agentic Deception
- A is for Absorption: Studying Feature Splitting and Absorption in SAEs
- Auditing Language Models for Hidden Objectives
- Who’s the Evil Twin? Differential Auditing for Undesired Behavior
- Intricacies of Feature Geometry in Large Language Models
- Modular Training of Neural Networks aids Interpretability
- Some Lessons from the OpenAI-FrontierMath Debacle
- Progress Measures for Grokking on Real-world Tasks
- Challenges in Mechanistically Interpreting Harmful Representations
- NICE: To Optimize In-Context Examples or Not?
- CataractBot: An LLM-Powered Expert-in-the-Loop Chatbot for Cataract Patients
- Predicting Treatment Adherence of Tuberculosis Patients at Scale

Research Experience

Currently a Research Scientist at the AI Security Institute (AISI). Previously an independent researcher, worked at Microsoft Research, and was an Associate Research Scientist at Wadhwani AI, working on AI for Social Good and Healthcare.

Education

Conducted research at UC Berkeley's Center for Human-Compatible AI (CHAI) and participated in the ML Alignment & Theory Scholars (MATS) program, working with Adrià Garriga-Alonso and Nandi Schoots. Also completed Neel Nanda's MATS training program.

Background

Research interests include frontier alignment, interpretability, and reinforcement learning, focusing on ensuring advanced AGI is safe and beneficial.

Miscellany