“The Alignment Waltz: Jointly Training Agents to Collaborate for Safety” (arXiv preprint): Introduced WaltzRL, a multi-agent RL framework that improves LLM safety and reduces overrefusals through collaborative agent training
“Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements” (ICLR 2025): Proposed a framework for adapting LLMs to diverse safety requirements at inference without retraining
“Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data” (NAACL 2025 oral): Developed models that quote verbatim from trusted pre-training sources to enable easy verification
“SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation” (NAACL 2024): Proposed SemStamp, a sentence-level semantic watermarking method robust to paraphrasing via locality-sensitive hashing (LSH)