- GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM, which achieves near-lossless compression ratio, 2x speedup, and 2x peak memory saving on inference time.
- KV Cache Optimizations for Large Language Model Inference (under review for Mlsys2024)
- Towards Sustainable Learning: Coresets for Data-efficient Deep Learning (ICML2023), which presents a dataset distillation algorithm based on submodular function and batch SGD.
Additionally, developed THOP: PyTorch-OpCounter, a Python third-party library that counts FLOPs of models.
Research Experience
Involved in several research projects, including torchanalyse, a model profiling tool based on TVM and Maestro, and Epipe, a research project that reduces activation transfer bandwidth during cloud-based training using compression algorithms.
Education
Received a B.Eng. in Computer Science from Zhejiang University in 2023; currently a PhD student at Georgia Institute of Technology, advised by Prof. Tushar Krishna; previously worked with Prof. Baharan at UCLA on efficient machine learning from massive datasets; and collaborated with Prof. Song Han at MIT on efficient machine learning on edge devices.
Background
Interested in efficient machine learning and systems, with experience at the intersection of both fields. Aims to use low-rank approximation and compression algorithms to accelerate machine learning models, especially LLMs. Also designs efficient systems like inference/fine-tuning schedulers to speed up the training/inference process.