Live Session
Wednesday Posters
Industry Poster
Improving Data Efficiency for Recommenders and LLMs
Noveen Sachdeva (Google DeepMind), Benjamin Coleman (Google DeepMind), Wang-Cheng Kang (Google DeepMind), Jianmo Ni (Google DeepMind), James Caverlee (Texas A&M University), Lichan Hong (Google DeepMind), Ed Chi (Google DeepMind) and Derek Cheng (Google DeepMind)
Abstract
In recent years, massive transformer-based architectures have driven breakthrough performance in practical applications like autoregressive text-generation (LLMs) and click-prediction (recommenders). A common recipe for success is to train large models on massive web-scale datasets, e.g., modern recommenders are trained on billions of user-item click events, and LLMs are trained on trillions of tokens extracted from the public internet. We are close to hitting the computational and economical limits of scaling up the size of these models, and we expect the next frontier of gains to come from improving the: (i) data quality of the training dataset, and (ii) data efficiency of the extremely expensive training procedure. Inspired by this shift, we present a set of “data-centric” techniques for recommendation and language models that summarizes a dataset into a terse data summary, which is both (i) high-quality, i.e., trains better quality models, and (ii) improves the data-efficiency of the overall training procedure. We propose techniques from two disparate data frameworks: (i) data selection (a.k.a., coreset construction) methods that sample portions of the dataset using grounded heuristics, and(ii) data distillation techniques that generate synthetic examples which are optimized to retain the signals needed for training high-quality models. Overall, this work sheds light on the challenges and opportunities offered by data optimization in web-scale systems, a particularly relevant focus as the recommendation community grapples with the grand challenge of leveraging LLMs.