International Symposium on High-Performance Computer Architecture · 2024
Cited
5
Resume (English only)
Academic Achievements
Selected publications include:
- Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market (SOSP'25)
- SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling (SIGCOMM'25)
- Towards LLM-Based Failure Localization in Production-Scale Networks (SIGCOMM'25)
- New Evolution of Hoyan: Enhancing Scalability, Usability, and Accuracy for Alibaba's Global WAN Verification (SIGCOMM'25)
- Alibaba Stellar: A New Generation RDMA Network for Cloud AI (SIGCOMM'25)
- SkyNet: Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures (SIGCOMM'25)
Research Experience
Prior to joining Alibaba, he was a research scientist and lecturer in the Computer Science Department at Yale University until Jun 2018. During that time, he worked with Ruzica Piskac, Mahesh Balakrishnan, and Avi Silberschatz on building cloud failure auditing systems; and, also worked with Joan Feigenbaum on tracking-resistant anonymous systems. He was also an instructor for Building Distributed Systems course.
Education
Received his Ph.D. degree in 2015 from Yale University, under the guidance of Bryan Ford. His dissertation work focused on building the first cloud-reliability auditing system (named Independence-as-a-Service or INDaaS) that proactively detects deep, unexpected dependencies potentially causing cloud-scale correlated failures, which was published in OSDI'14.
Background
Currently a Director of Network Research at Alibaba Cloud. His research focuses on building high-performance and reliable network systems for AI and Cloud, with a particular emphasis on network for AI and AI for network.