Scholar

Austin Xu

Google Scholar ID: OUw3iQgAAAAJ

Salesforce AI Research

language modelingretrieval augmented generationlearning from human feedback

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

203

H-index

i10-index

Publications

Co-authors

list available

Contact

CVOpen ↗GitHubOpen ↗LinkedInOpen ↗

Publications

21 items

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

2026

Cited

VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

2026

Cited

MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

2026

Cited

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

2026

Cited

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

2026

Cited

SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

2025

Cited

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

2025

Cited

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

2025

Cited

Resume (English only)

Academic Achievements

Published multiple papers in arXiv, EMNLP, ICML, and ACL, including 'Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains' and 'Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math'.

Research Experience

Research Scientist at Salesforce AI Research. Interned at Duolingo and Amazon during his graduate studies.

Education

PhD from Georgia Tech, advised by Mark Davenport and collaborated closely with Ashwin Pananjady. BSE from the University of Michigan. During graduate school, interned at Duolingo, working with Will Monroe, and at Amazon, working with Arjun Seshadri, Mariya Vasileva, and Achal Dave.

Background

Research Scientist currently at Salesforce AI Research. His current research broadly focuses on improving the reasoning ability of foundation models, particularly large language models, with a particular emphasis on automatic evaluation (generative reward models/LLM-as-judges). He works largely on post-training and evaluation. Broadly, he is interested in the role humans play in the era of large models: when are human responses necessary, and when can we avoid collecting human feedback?

Miscellany

In his free time, he enjoys cooking (and eating), reading, running, and watching basketball (NBA and college).

Co-authors

2 total

Shafiq Joty

Sr. Research Director at Salesforce Research, Assoc. Prof. at NTU (on leave)

Mark A. Davenport

Professor of Electrical & Computer Engineering, Georgia Institute of Technology