Live Session
Thursday Posters
Research
Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBark
Shijie Liu (NVIDIA), Nan Zheng (NVIDIA), Hui Kang (NVIDIA), Xavier Simmons (NVIDIA), Junjie Zhang (NVIDIA), Matthias Langer (NVIDIA), Wenjing Zhu (NVIDIA), Minseok Lee (NVIDIA) and Zehuan Wang (NVIDIA)
Abstract
Training large-scale deep learning recommendation models (DLRMs) with embedding tables stretching across multiple GPUs in a cluster presents a unique challenge, demanding the efficient scaling of embedding operations that require substantial memory and network bandwidth within a hierarchical network of GPUs. To tackle this bottleneck, we introduce EMBark---a comprehensive solution aimed at enhancing embedding performance and overall DLRM training throughput at scale. EMBark empowers users to create and customize sharding strategies, and features a highly-automated sharding planner, to accelerate diverse model architectures on different cluster configurations. EMBark groups embedding tables, considering their preferred communication compression method to reduce communication overheads effectively. It embraces efficient data-parallel category distribution, combined with topology-aware hierarchical communication, and pipelining support to maximize the DLRM training throughput. Across four representative DLRM variants (DLRM-DCNv2, T180, T200, and T510), EMBark achieves an average end-to-end training throughput speedup of 1.5x and up to 1.77x over traditional table-row-wise sharding approaches.