Publications: 'Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols', 'Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM', 'Capability-Based Scaling Laws for LLM Red-Teaming', 'An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks'. Conference presentations: Presented at various international conferences such as ICML 2025, ICLR 2025, NeurIPS 2024, etc.
Research Experience
Research projects: Focusing on jailbreaking attacks on LLMs, including threat models, effectiveness of safety jailbreaks, and more.
Education
Degree: PhD; Institution: ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems; Start date: May 1, 2024.
Background
Research interests: adversarial robustness, AI safety, and ML security. Bio: I am a second-year ELLIS / IMPRS-IS PhD student, advised by Jonas Geiping and Maksym Andriushchenko.
Miscellany
Personal interests: Enjoy finding various ways to break machine learning systems.