https://ntp.niehs.nih.gov/go/910964

Curation of Developmental Toxicity Reference Data

Establishing confidence in alternatives to animal use for developmental toxicity and other endpoints requires high-quality reference data for evaluation of new approaches. Identifying, extracting, and annotating information from the full text of scientific publications is a critical step in compiling such data sets. However, manually extracting protocol details (e.g. species, route of administration, dosing regimen) and treatment-related findings is labor-intensive and can introduce errors. Furthermore, for these data to be optimally useful and adhere to FAIR data principles (findability, accessibility, interoperability, and reusability), they should be curated using standardized terminologies and controlled vocabularies. NICEATM and collaborators are exploring approaches to standardizing and automating these processes.

Semi-automated Extraction of Literature Data Using Machine Learning Methods

NICEATM, other scientists within the NIEHS Division of Translational Toxicology, Oak Ridge National Labs, and FDA are collaborating to automate the process of identifying high-quality developmental toxicity studies in the published scientific literature. The approach applies natural language processing and machine learning methods to identify specific data elements in the full text of scientific publications using both unsupervised and supervised approaches.

Preliminary models were trained using the uterotrophic database (Kleinstreuer et al. 2016) built for the EPA Endocrine Disruptor Screening Program, and leverage natural language processing and multivariate machine learning models to identify papers that meet minimum criteria to be considered guideline-like studies (Herrmannova et al. 2018). Supervised and unsupervised approaches have been developed to automatically extract text features that correspond to study descriptors and classify papers based on their adherence to minimum criteria derived from regulatory guideline studies, and these methods demonstrate high cross-validated performance on the uterotrophic training set. This work is being extended and applied to automate the identification of high-quality prenatal developmental toxicity studies in the literature, in collaboration with the ICCVAM Developmental and Reproductive Toxicity Expert Group. 

Extraction and Annotation of Legacy Developmental Toxicity Study Data

To support the evaluation of non-animal approaches for developmental toxicity assessment, NICEATM scientists extracted information from over 100 NTP legacy prenatal developmental toxicity animal studies and a subset of about 50 studies submitted to ECHA that were deemed high-quality by NTP subject matter experts (Foster et al. 2024). Study details extracted included species, strain, administration route, dosing duration, and treatment-related effects. Extracted data were standardized by applying controlled vocabularies and ontologies to facilitate computational analyses and integration with other structured databases such as EPA’s ToxRefDB. Elements of three controlled vocabularies (the Unified Medical Language coding system, the German Institute for Risk Assessment [BfR] DevToxDB ontology, and the OECD Harmonized Template 74 terminologies) were combined with automation code to programmatically standardize primary source language of extracted developmental toxicology endpoints. Augmenting manual efforts with automation tools increased the efficiency of producing a FAIR dataset of regulatory guideline studies. This open-source approach can be readily applied to other legacy developmental toxicology datasets, and the code design is customizable for other study types.