2021 FDA Science Forum

SPL-BERT: A Deep Bidirectional Transformer Language Model Using Structured Product Labeling (SPL) Resources of FDA

Authors:: Poster Author(s)

Haider, Saad, FDA/NCTR; Xu, Joshua, FDA/NCTR; Tong, Weida, FDA/NCTR; Wu, Leihong, FDA/NCTR

Center:: Contributing Office

National Center for Toxicological Research

Abstract

Poster Abstract

Background and Purpose

Extracting and mining information from text-based documents using natural language processing advanced significantly in recent years and is being applied to diverse biomedical documents including publications, electronic health records and product adverse event reports. Deep learning-based language models trained on domain specific texts can perform various language tasks such as name entity recognition (NER), question answering (Q&A), relation extraction (RE), etc.

The FDA uses Structured Product Labeling (SPL) standard for disseminating information of regulated products including ~130,000 drug products. Particularly, SPLs contain specific information related to drug safety and efficacy that current generic language models were not well trained on. Hence, a specific language model was developed by learning the SPL documents.

Methodology and Results

The initial model used ~41,000 labels of human prescription drugs for training. It contains ~146M words including sections like boxed warnings (1.2M words), adverse reactions (5M words), indications (2M words), etc. The procedure updated the existing pre-trained weights with SPL dataset using an in-house GPU server with eight Nvidia Tesla V100 (32 GB) cards. With a batch size of 64 and maximum token-length of 128, the corpus size was about ~1.5GB which took around 22 hours to finish the training.

The new BERT model (i.e., SPL-BERT) outperformed the existing ones on adverse reaction alerts, Q&A, and NER. For instance, the SPL-BERT accurately predicted hypercalciuria, an adverse reaction of abaloparatide, where the general BERT-base model failed to recognize. A UMAP clustering analysis on the word embedding features also indicated that SPL-BERT clustered drug-related terminologies better than other BERT models.

Conclusion

In all, a domain specific language model, SPL-BERT, was developed on drug labeling documents, and it performed well on labeling studies. This effort will support its further enhancements in order to realize its application in FDA regulatory activities such as drug labeling review or data analysis.

Poster Image

Preview image of the scientific poster. For more information, please refer to the abstract or download the PDF version of the poster.

Download the Poster (PDF; 0.34 MB)