INA Research Group

ES-MoE: Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training (Credit: SOD)

ES-MoE overlaps expert’s computation and communication and pipelines CPU optimization at the expert granularity to overlap with the backward pass of the layer. E0, ..., E3 indicate experts in the same layer.

Traditional MoE training (static expert placement)

ES-MoE (dynamic expert placement)

Example of training a MoE model with 4 GPUs and 8 experts. E0 to E7 indicates separate experts in the same layer. Dynamic expert placement of ES-MoE eliminates the need for zero padding, achieving high efficiency.

Summary

Mixture-of-Experts (MoE) is a powerful technique for enhancing the performance of neural networks while decoupling computational complexity from the number of parameters. However, despite this, scaling the number of experts requires adding more GPUs. In addition, the load imbalance in token load across experts causes unnecessary computation or straggler problems. We present ES-MoE, a novel method for efficient scaling MoE training. It offloads expert parameters to host memory and leverages pipelined expert processing to overlap GPU-CPU communication with GPU computation. It dynamically balances token loads across GPUs, improving computational efficiency. ES-MoE accelerates MoE training on a limited number of GPUs without degradation in model performance. We validate our approach on GPT-based MoE models, demonstrating 67× better scalability and up to 17.5× better throughput over existing frameworks.

Publications

ICML

Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training

Yechan Kim*, Hwijoon Lim*, and Dongsu Han

In Proceedings of the 41st International Conference on Machine Learning Jul 2024

Paper Project Code

Members

Yechan Kim

Alumni

Hwijoon Lim

Alumni

Dongsu Han

Principal Investigator