['"Does quantization affect models' performance on long-context tasks?", EMNLP 2025', '"BEARCUBS: A benchmark for computer-using web agents", COLM 2025', '"Enhancing Human Evaluation in Machine Translation with Comparative Judgment", ACL 2025', '"Localizing and Mitigating Errors in Long-form Question Answering", Findings of ACL 2025', '"VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation", Findings of EMNLP 2024', '"GEE! Grammar Error Explanation with Large Language Models", Findings of NAACL 2024']