Several papers have been accepted at top conferences such as NeurIPS, DAC, and MICRO, and nominated for best paper at ASP-DAC 2024. Specific papers include: DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs; Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference; PacQ: A SIMT Microarchitecture for Efficient Dataflow in Hyper-asymmetric GEMMs; LoAS: Fully Temporal-Parallel Dataflow for Dual-Sparse Spiking Neural Networks; MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking Neural Networks; Workload-Balanced Pruning for Sparse Spiking Neural Networks; Wearable-based Human Activity Recognition with Spatio-Temporal Spiking Neural Networks; SATA: Sparsity-Aware Training Accelerator for Spiking Neural Networks.
Research Experience
He has interned at Microsoft Azure, working with the AI System Architecture team, and at Cerebras Systems, working with the ASIC team.
Education
He is currently a final year Ph.D. student in the Department of Electrical Engineering at Yale University, advised by Prof. Priyadarshini Panda. Prior to joining Yale, he earned his B.S. from the University of Wisconsin-Madison, majoring in Electrical Engineering, Computer Science, and Mathematics. During his undergraduate, he worked with Prof. Joshua San Miguel on designing computer architectures for stochastic computing.
Background
His research focuses on designing energy-efficient computer architectures, systems, and algorithms for AI workloads, particularly those involving asymmetric operand precision or sparsity. He is also interested in neuromorphic computing, as enablers for bio-plausible and energy-efficient deep learning (spiking neural networks).