Humanity's Last Exam (HLE): Developed an extremely challenging LLM benchmark with 3,000 expert-designed questions across mathematics, philosophy, and sciences, highlighting a significant gap between AI and human experts (arXiv Preprint).
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet: Demonstrated that current LLM defenses fail against multi-turn human adversarial attacks; accepted as an Oral presentation at NeurIPS 2024 Red Teaming Workshop.
The WMDP Benchmark: Introduced a benchmark measuring hazardous knowledge in biosecurity, cybersecurity, and chemical security, and proposed RMU—an unlearning method that reduces harmful capabilities while preserving general performance (ICML 2024).
Virology Capabilities Test (VCT): Co-developed a multimodal virology Q&A benchmark assessing LLMs’ ability to troubleshoot lab protocols (arXiv Preprint).
HarmBench: Contributed to a standardized evaluation framework for automated red teaming, benchmarking 18 methods and 33 LLMs/defenses (ICML 2024).
Representation Engineering: Participated in developing a top-down approach to AI transparency (arXiv Preprint).