Abstract

In this paper, we propose FastDoc (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around 1,000, 4,500, and 500 times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that FastDoc either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, FastDoc shows a negligible drop in performance on open domain.

FastDoc: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Abhilash Nandy · Manav Nitin Kapadnis · Sohan Patnaik · Yash Parag Butala · Pawan Goyal · Niloy Ganguly

Video

Paper PDF

Abstract

***FastDoc***: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Abhilash Nandy · Manav Nitin Kapadnis · Sohan Patnaik · Yash Parag Butala · Pawan Goyal · Niloy Ganguly

Video

Paper PDF

Abstract

FastDoc: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy