U.S. flag

An official website of the United States government

Dot gov

The .gov means it's official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.


ICCVAM Biennial Report 2020-2021

Biennial Progress Report 2020-2021 Interagency Coordinating Committee on the Validation of Alternative Methods

Semi-automated Extraction of Literature Data Using Machine Learning Methods

NICEATM, other scientists within the NIEHS Division of the NTP, the DOE's Oak Ridge National Laboratory, and FDA are collaborating to automate the process of identifying high-quality developmental toxicity studies in the published scientific literature. The approach applies natural language processing and machine learning methods to identify specific data elements in the full text of scientific publications using both unsupervised and supervised approaches.

Preliminary models were trained using a uterotrophic database (Kleinstreuer et al. 2016) built for the EPA Endocrine Disruptor Screening Program. The models leveraged natural language processing and multivariate machine learning models to identify papers that meet minimum criteria to be considered guideline-like studies (Herrmannova et al. 2018). Supervised and unsupervised approaches were developed to automatically extract text features that correspond to study descriptors and classify papers based on their adherence to minimum criteria derived from regulatory guideline studies. These methods demonstrated high cross-validated performance on the uterotrophic training set.

This work is being extended and applied to automate the identification of high-quality prenatal developmental toxicity studies in the literature, in collaboration with the ICCVAM Developmental and Reproductive Toxicity Expert Group. A publication describing this work is being drafted for submission in 2022.