U.S. flag

An official website of the United States government

Dot gov

The .gov means it's official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.


ICCVAM Biennial Report 2018-2019

Biennial Progress Report 2018-2019 Interagency Coordinating Committee on the Validation of Alternative Methods

Semi-automated Extraction of Literature Data Using Machine Learning Methods

Identifying and extracting information from the full text of scientific publications is a critical step required in developing reference databases for establishing confidence in alternative approaches. However, manually extracting protocol details such as species, route of administration, and dosing regimen is labor-intensive and can introduce errors. NIEHS and the Department of Energy’s Oak Ridge National Laboratory are applying natural language processing and machine learning methods using both unsupervised and supervised approaches to identify specific data elements in the full text of scientific publications. For example, an unsupervised approach was developed to identify text segments (sentences) relevant to a set of criteria describing specific study parameters, such as species, route of administration, and dosing regimen. A binary classifier was then trained to identify publications that met the criteria. The classifier performed better when trained on the candidate sentences than when trained on sentences randomly picked from the text, supporting the hypothesis that this method could accurately identify study descriptors. This work is being expanded to include machine learning-based multivariate models combined with natural language processing to automatically extract text features that correspond to study descriptors and classify papers based on their adherence to minimum criteria derived from regulatory guideline studies. A publication is being drafted for submission in 2020.