Pengfei Zheng
Scholar

Pengfei Zheng

Google Scholar ID: 3QxdztoAAAAJ
Huawei Technologies
Machine Learning SystemSystem-Algorithm Co-DesignDistributed SystemData+AI
Citations & Impact
All-time
Citations
241
 
H-index
5
 
i10-index
4
 
Publications
12
 
Co-authors
12
list available
Resume (English only)
Academic Achievements
  • arXiv preprint: 'Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling'
  • 'CoffeeBoost: Gradient Boosting Native Conformal Inference for Bayesian Optimization' accepted to AAAI 2025
  • 'Centrum: Model-based Database Auto-tuning with Minimal Distributional Assumptions' accepted to SIGMOD 2025
  • 'Mirage: MOE + Decision Transformer for non-interruption, non-overlap resource provision in ML training' published at SC 2023
  • 'Shockwave: Fair and Efficient Scheduling for Dynamic Adaptation in Machine Learning' published at NSDI 2023
  • Mentored students including Lynn Liu (HyperAPX, now Ph.D. at UC Berkeley), Rui Pan (Shockwave, now Ph.D. at Princeton), and Calvin Ma (Hound)
Background
  • Full-time Staff Researcher at Huawei Technologies
  • Research areas include autonomous system agents, distributed systems, and disaggregated datacenter architecture
  • Designs statistical and neural learning methods to model dynamics and uncertainty in large-scale distributed computer systems
  • Develops algorithmic decision-making mechanisms (e.g., convex/nonlinear optimization, Bayesian optimization, contextual bandits, reinforcement learning) to optimize system performance, efficiency, and scalability
  • Recent work includes black-box optimization for autonomous system tuning, real-time recommendation algorithms for preconditioner/solver selection in linear systems, dynamic market theory, and stochastic policy learning for scheduling and resource allocation
  • Works on high-performance ML training and inference, building multi-dimensional hybrid parallelism solvers (data, tensor, pipeline, sequence, expert parallelism, offload, rematerialization, pipeline interleaving) to optimize token throughput, MFU, and scale-out linearity for hyper-scale LLM training on massive GPU/NPU clusters
  • Researches dynamic MoE layers and real-time model/input pruning techniques to accelerate LLM inference