https://ntp.niehs.nih.gov/go/927943

Semi-automated Extraction of Literature Data Using Machine Learning Methods

NICEATM, other scientists within the NIEHS Division of the NTP, the DOE's Oak Ridge National Laboratory, and FDA are collaborating to automate the process of identifying high-quality developmental toxicity studies in the published scientific literature. The approach applies natural language processing and machine learning methods to identify specific data elements in the full text of scientific publications using both unsupervised and supervised approaches.

Preliminary models were trained using a uterotrophic database (Kleinstreuer et al. 2016) built for the EPA Endocrine Disruptor Screening Program. The models leveraged natural language processing and multivariate machine learning models to identify papers that meet minimum criteria to be considered guideline-like studies (Herrmannova et al. 2018). Supervised and unsupervised approaches were developed to automatically extract text features that correspond to study descriptors and classify papers based on their adherence to minimum criteria derived from regulatory guideline studies. These methods demonstrated high cross-validated performance on the uterotrophic training set.

This work is being extended and applied to automate the identification of high-quality prenatal developmental toxicity studies in the literature, in collaboration with the ICCVAM Developmental and Reproductive Toxicity Expert Group. A publication describing this work is being drafted for submission in 2022.