Live Session
Chamber of Commerce
Poster
16 Oct
 
8:00
CEST
Wednesday Posters
Add Session to Calendar 2024-10-16 08:00 am 2024-10-16 05:30 pm Europe/Rome Wednesday Posters Wednesday Posters is taking place on the RecSys Hub. Https://recsyshub.org
Industry Poster

Improving Data Efficiency for Recommenders and LLMs

View on ACM Digital Library

Noveen Sachdeva (Google DeepMind), Benjamin Coleman (Google DeepMind), Wang-Cheng Kang (Google DeepMind), Jianmo Ni (Google DeepMind), James Caverlee (Texas A&M University), Lichan Hong (Google DeepMind), Ed Chi (Google DeepMind) and Derek Cheng (Google DeepMind)

View Paper PDFView Poster
Abstract

In recent years, massive transformer-based architectures have driven breakthrough performance in practical applications like autoregressive text-generation (LLMs) and click-prediction (recommenders). A common recipe for success is to train large models on massive web-scale datasets, e.g., modern recommenders are trained on billions of user-item click events, and LLMs are trained on trillions of tokens extracted from the public internet. We are close to hitting the computational and economical limits of scaling up the size of these models, and we expect the next frontier of gains to come from improving the: (i) data quality of the training dataset, and (ii) data efficiency of the extremely expensive training procedure. Inspired by this shift, we present a set of “data-centric” techniques for recommendation and language models that summarizes a dataset into a terse data summary, which is both (i) high-quality, i.e., trains better quality models, and (ii) improves the data-efficiency of the overall training procedure. We propose techniques from two disparate data frameworks: (i) data selection (a.k.a., coreset construction) methods that sample portions of the dataset using grounded heuristics, and(ii) data distillation techniques that generate synthetic examples which are optimized to retain the signals needed for training high-quality models. Overall, this work sheds light on the challenges and opportunities offered by data optimization in web-scale systems, a particularly relevant focus as the recommendation community grapples with the grand challenge of leveraging LLMs.

Join the Conversation

Head to Slido and select the paper's assigned session to join the live discussion.

Conference Agenda

View Full Agenda →
No items found.