Curation of Developmental Toxicity Reference Data
Establishing confidence in alternatives to animal use for developmental toxicity and other endpoints requires high-quality reference data for evaluation of new approaches. Identifying, extracting, and annotating information from the full text of scientific publications is a critical step in compiling such data sets. However, manually extracting protocol details (e.g. species, route of administration, dosing regimen) and treatment-related findings is labor-intensive and can introduce errors. Furthermore, for these data to be optimally useful and adhere to FAIR data principles (findability, accessibility, interoperability, and reusability), they should be curated using standardized terminologies and controlled vocabularies. NICEATM and collaborators are exploring approaches to standardizing and automating these processes.
Semi-automated Extraction of Literature Data Using Machine Learning Methods
NICEATM, other scientists within the NIEHS Division of the NTP, Oak Ridge National Labs, and FDA are collaborating to automate the process of identifying high-quality developmental toxicity studies in the published scientific literature. The approach applies natural language processing and machine learning methods to identify specific data elements in the full text of scientific publications using both unsupervised and supervised approaches.
Preliminary models were trained using the uterotrophic database (Kleinstreuer et al. 2016) built for the EPA Endocrine Disruptor Screening Program, and leverage natural language processing and multivariate machine learning models to identify papers that meet minimum criteria to be considered guideline-like studies (Herrmannova et al. 2018). Supervised and unsupervised approaches have been developed to automatically extract text features that correspond to study descriptors and classify papers based on their adherence to minimum criteria derived from regulatory guideline studies, and these methods demonstrate high cross-validated performance on the uterotrophic training set.
This work is being extended and applied to automate the identification of high-quality prenatal developmental toxicity studies in the literature, in collaboration with the ICCVAM Developmental and Reproductive Toxicity Expert Group. A publication describing this work is being drafted for submission in 2022.
Extraction and Annotation of Legacy Developmental Toxicity Study Data
To support the evaluation of non-animal approaches for developmental toxicity assessment, NICEATM scientists extracted information from over 100 NTP legacy prenatal developmental toxicity animal studies and a subset of about 50 studies submitted to ECHA that were deemed high-quality by NTP subject matter experts. Study details extracted included species, strain, administration route, dosing duration, and treatment-related effects.
Efforts are underway to standardize the extracted data by applying controlled vocabularies and ontologies to facilitate computational analyses and integration with other structured databases such as EPA’s ToxRefDB. Elements of three controlled vocabularies (the Unified Medical Language coding system, the German Institute for Risk Assessment [BfR] DevToxDB ontology, and the OECD Harmonized Template 74 terminologies) were combined with automation code to programmatically standardize primary source language of extracted developmental toxicology endpoints. This work aims to reduce manual labor, facilitate further analyses (e.g. systematic review, model-building, new approach methodology validation), and uphold FAIR principles. A poster describing this work (Foster et al.) was presented at the 11th World Congress on Alternatives and Animal Use in the Life Sciences, and a publication is being drafted for submission in 2022.