Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

Parthiv Chatterjee · Shivam R Sonawane · Amey Hengle · Aditya Tanna · Sourish Dasgupta · Tanmoy Chakraborty

Video

Paper PDF

Thumbnail of paper pages

Abstract

Document summarization facilitates efficient identification and assimilation of user-relevant content, a process inherently influenced by individual subjectivity. Discerning $\textit{subjective}$ salient information within a document, particularly when it has multiple facets, poses significant challenges. This complexity underscores the necessity for $\textit{personalized summarization}$. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., $\textit{click-skip}$ trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a rare resource in this direction. However, the training data only contains preference history $\textit{without any target summaries}$, thereby blocking end-to-end supervised learning. Also, the diversity in terms of topic transitions along the trajectory is relatively low, thereby leaving scope for better generalization. To address this, we propose PerAugy, a novel $\textit{cross-trajectory shuffling}$ and $\textit{summary-content perturbation}$-based data augmentation technique that significantly boosts the accuracy of four state-of-the-art (SOTA) baseline user-encoders commonly used in personalized summarization frameworks (\text{best result}: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization ($\text{avg. boost}$: ${61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by PerAugy, we introduce three dataset diversity metrics -- $\mathrm{TP}$, $\mathrm{RTC}$, and DegreeD to quantify the induced diversity. We find that $\mathrm{TP}$ and DegreeD have a strong correlation with the user-encoder performance when trained on the PerAugy-generated dataset across all accuracy metrics, indicating that the increase in dataset diversity plays a major role in performance gain.