VLDB'22: Harmony – enables training massive DNN models on commodity servers by overcoming GPU memory limits.
MLSys'22: BNS-GCN – efficient full-graph GCN training via partition-parallelism and random boundary sampling (co-first author).
ICLR'22: PipeGCN – pipelined feature communication for efficient full-graph GCN training.
MobiCom'21: Visage – enables timely analytics for drone imagery (co-first author; code deployed in Microsoft’s FarmBeats).
HotOS'21: Advocates training large DNNs on commodity hardware for broader accessibility.
MICRO'19: DeepStore – in-storage acceleration for intelligent queries.
ISCA'19: iSwitch – in-switch computing to accelerate distributed reinforcement learning.
NeurIPS'18: PipeSGD – decentralized pipelined SGD framework for distributed deep net training.
Background
Currently a Research Scientist at ByteDance, founding and building project veScale from scratch to support 99% of internal training jobs.
Research focuses on Distributed Machine Learning Systems, spanning Efficient ML, System Optimization, and Hardware Acceleration.
Pioneer in pipelined data parallelism (NeurIPS'18) and in-network acceleration for distributed training using SmartNICs (MICRO'18) and programmable switches (ISCA'19).
Recent work includes massive model training systems (VLDB'22) and massive graph training systems (MLSys'22).
Independently established his research direction in distributed ML systems during his Ph.D. at UIUC, where few students had previously worked in this area.