Published several papers in top-tier conferences such as NeurIPS, ICML, ICLR, and EMNLP. Key contributions include: a lossless compression technique reducing model size by 30% while maintaining identical outputs and enabling efficient GPU inference; introducing fine-tunable sketches for efficient LLM adaptation; presenting LeanQuant for accurate and scalable LLM quantization; and developing an efficient LLM inference method using only 1 bit per channel for KV cache.
Research Experience
During his PhD at Rice University, he focused on LLMs' compression and optimization techniques, developing a lossless compression technique that reduces model size by 30% while preserving bit-for-bit identical outputs and enabling efficient GPU inference.
Education
2021 - 2025: PhD in Computer Science at Rice University, advised by Prof. Anshumali Shrivastava; 2016 - 2021: B.S. in Computer Science from the University of Waterloo.
Background
PhD candidate in Computer Science, with research interests in lossless and lossy model compression, inference optimizations, accurate and efficient fine-tuning, GPU kernel design and optimization, and quantization. Aims to make large language models (LLMs) and foundation models more efficient, accurate, and accessible.
Miscellany
Also goes by Tony, has made open-source contributions, and his work reached #1 on Hacker News, with models on Hugging Face receiving thousands of monthly downloads.