Scholar

Tomek Korbak

Google Scholar ID: YQ5rrk4AAAAJ

UK AI Security Institute

language modelsAI safetyreinforcement learningchain of thought monitoringLLM agents

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

4,090

H-index

22

i10-index

29

Publications

20

Co-authors

8

list available

Contact

Emailtomasz.korbak@gmail.com CVOpen ↗TwitterOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

8 items

Reasoning Models Struggle to Control their Chains of Thought

2026

Cited

0

Training Agents to Self-Report Misbehavior

2026

Cited

0

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

2025

Cited

0

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

2025

Cited

0

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

2025

Cited

0

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

2025

Cited

0

Fundamental Limitations in Defending LLM Finetuning APIs

2025

Cited

0

A sketch of an AI control safety case

2025

Cited

0

Resume (English only)

Academic Achievements

Published multiple papers at top-tier conferences including ICLR, ICML, NeurIPS, and COLM, such as:
“A sketch of an AI control safety case”
“Looking Inward: Language Models Can Learn About Themselves by Introspection” (ICLR 2025)
“Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data” (COLM 2024)
“Pretraining Language Models with Human Preferences” (ICML 2023, oral)
“On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting” (NeurIPS 2022, oral)
Many papers accompanied by open-source code.

Co-authors

8 total

Affiliate, CHAI, UC Berkeley

Samuel R. Bowman

Anthropic and NYU

New York University, Genentech

Associate Professor, University of Toronto

Sussex University