Live Session
Session 16: Large Language Models 2
Reproducibility
Reproducibility of LLM-based Recommender Systems: the case study of P5 paradigm
Pasquale Lops (University of Bari Aldo Moro), Antonio Silletti (University of Bari Aldo Moro), Marco Polignano (University of Bari Aldo Moro), Cataldo Musto (University of Bari Aldo Moro) and Giovanni Semeraro (University of Bari Aldo Moro)
Abstract
Recommender systems field may greatly benefit of the availability of pretrained Large Language Models (LLMs), which can serve as the core mechanism to generate recommendations based on detailed user and item data, such as textual descriptions, user reviews, and metadata. On one hand this new generation of LLM-based recommender systems paves the way to deal with traditional limitations, such as cold-start and data sparsity, but on the other hand this poses fundamental challenges for their accountability. Reproducing experiments in the new context of LLM-based recommender systems is very challenging for several reasons. New approaches are published at an unprecedented pace, which makes difficult to have a clear picture of the main protocols and good practices in the experimental evaluation. Moreover, the lack of proper frameworks for LLM-based recommendation development and evaluation makes the process of benchmarking models complex and uncertain. In this work, we discuss the main issues encountered when trying to reproduce P5 (Pretrain, Personalized Prompt, and Prediction Paradigm), one of the first works unifying different recommendation tasks in a shared language modeling and natural language generation framework. Starting from this study, we have developed OurFramework4LLM (anonymized name), a framework for training and evaluating LLMs, specifically for the recommendation task. It has been used to perform several experiments to assess the impact of different LLMs, personalization and novel set of more informative prompts on the overall performance of recommendations, in a fully reproducible environment.