Live Session
Wednesday Posters
Industry Poster
Toward 100TB Recommendation Models with Embedding Offloading
Intaik Park (Meta), Paul Zhang (Meta), Ehsan Ardestani (Meta), Damian Reeves (Meta), Sarunya Pumma (Meta), Henry Tsang (Meta), Levy Zhao (Meta), Jian He (Meta), Joshua Deng (Meta), Dennis Van der Staay (Meta) and Yu Guo (Meta)
Abstract
Training recommendation models become memory-bound with large embedding tables, and fast GPU memory is scarce. In this paper, we explore embedding caches and prefetch pipelines to effectively leverage large but slow host memory for embedding tables. We introduce Locality-Aware Sharding and iterative planning that automatically size caches optimally and produce effective sharding plans. Embedding Offloading, a system that combines all of these components and techniques, is implemented on top of Meta’s open-source libraries, FBGEMM GPU and TorchRec, and it is used to improve scalability and efficiency of industry-scale production models. Embedding Offloading achieved 37x model scale to 100TB model size with only 26% training speed regression.