Published multiple papers in top conferences such as NeurIPS, ICLR, and ICML. Specific works include:
- ABBEL: Acting via Belief Bottlenecks Expressed in Language
- Among Us: A Sandbox for Measuring and Detecting Agentic Deception
- A is for Absorption: Studying Feature Splitting and Absorption in SAEs
- Auditing Language Models for Hidden Objectives
- Who’s the Evil Twin? Differential Auditing for Undesired Behavior
- Intricacies of Feature Geometry in Large Language Models
- Modular Training of Neural Networks aids Interpretability
- Some Lessons from the OpenAI-FrontierMath Debacle
- Progress Measures for Grokking on Real-world Tasks
- Challenges in Mechanistically Interpreting Harmful Representations
- NICE: To Optimize In-Context Examples or Not?
- CataractBot: An LLM-Powered Expert-in-the-Loop Chatbot for Cataract Patients
- Predicting Treatment Adherence of Tuberculosis Patients at Scale
Research Experience
Currently a Research Scientist at the AI Security Institute (AISI). Previously an independent researcher, worked at Microsoft Research, and was an Associate Research Scientist at Wadhwani AI, working on AI for Social Good and Healthcare.
Education
Conducted research at UC Berkeley's Center for Human-Compatible AI (CHAI) and participated in the ML Alignment & Theory Scholars (MATS) program, working with Adrià Garriga-Alonso and Nandi Schoots. Also completed Neel Nanda's MATS training program.
Background
Research interests include frontier alignment, interpretability, and reinforcement learning, focusing on ensuring advanced AGI is safe and beneficial.
Miscellany
Enjoys writing fiction and poetry, believing that writing poetry allows a channel into emotions that could not have been expressed another way.