Satvik Golechha
Scholar

Satvik Golechha

Google Scholar ID: N-j6EO8AAAAJ
Research Scientist, AISI
AGI securityalignmentinterpretabilityreinforcement learning
Citations & Impact
All-time
Citations
110
 
H-index
6
 
i10-index
3
 
Publications
15
 
Co-authors
10
list available
Resume (English only)
Academic Achievements
  • Published multiple papers in top conferences such as NeurIPS, ICLR, and ICML. Specific works include:
  • - ABBEL: Acting via Belief Bottlenecks Expressed in Language
  • - Among Us: A Sandbox for Measuring and Detecting Agentic Deception
  • - A is for Absorption: Studying Feature Splitting and Absorption in SAEs
  • - Auditing Language Models for Hidden Objectives
  • - Who’s the Evil Twin? Differential Auditing for Undesired Behavior
  • - Intricacies of Feature Geometry in Large Language Models
  • - Modular Training of Neural Networks aids Interpretability
  • - Some Lessons from the OpenAI-FrontierMath Debacle
  • - Progress Measures for Grokking on Real-world Tasks
  • - Challenges in Mechanistically Interpreting Harmful Representations
  • - NICE: To Optimize In-Context Examples or Not?
  • - CataractBot: An LLM-Powered Expert-in-the-Loop Chatbot for Cataract Patients
  • - Predicting Treatment Adherence of Tuberculosis Patients at Scale
Research Experience
  • Currently a Research Scientist at the AI Security Institute (AISI). Previously an independent researcher, worked at Microsoft Research, and was an Associate Research Scientist at Wadhwani AI, working on AI for Social Good and Healthcare.
Education
  • Conducted research at UC Berkeley's Center for Human-Compatible AI (CHAI) and participated in the ML Alignment & Theory Scholars (MATS) program, working with Adrià Garriga-Alonso and Nandi Schoots. Also completed Neel Nanda's MATS training program.
Background
  • Research interests include frontier alignment, interpretability, and reinforcement learning, focusing on ensuring advanced AGI is safe and beneficial.
Miscellany
  • Enjoys writing fiction and poetry, believing that writing poetry allows a channel into emotions that could not have been expressed another way.