High-Throughput Transcriptomics and the S1500+ Gene Set Strategy

Tox21 Phase III includes a focus on mid- to high-throughput gene expression screens using a variety of normal human cells and cell lines. These screens capture information from the whole transcriptome (i.e., the entirety of all expressed RNA molecules in a cell or biological sample) to gain insight into how biological systems respond to substance exposures. One approach in this project is referred to as the "S1500+ Gene Set High-Throughput Transcriptomics Project."

The transcriptome of a biological system is dynamic, changing in composition in response to various factors. As such, the transcriptome can be considered a read-out of the physiological or pathophysiological status of a biological system.

Under the S1500+ Gene Set Strategy, the Tox21 Working Group developed and used a hybrid approach comprised of five sequential modules to identify the optimal set of genes that best represents biological diversity, addresses gene-gene co-expression relationships, and represents known pathways adequately. This hybrid approach accurately balances data-driven and knowledge-based evidence while allowing for performance assessment of the selected genes with respect to the gene set's ability to extrapolate whole transcriptome changes, both at the individual gene level and pathway level.

Download the Human S1500+ Gene Set

For the selection of a set of human sentinel genes (also referred to as the human S1500+ gene set), the Tox21 Working Group downloaded 3,339 gene expression series from the Gene Expression Omnibus (GEO) on February 3, 2014. These gene expression series contained 91,491 samples that utilized the HG-U133plus2 Affymetrix microarray platform. The data were manually curated by pairing control and corresponding non-control samples using experimental design and other relevant information provided in the GEO database and/or associated manuscripts. The manually curated ".CEL" files were processed using RMA normalization. Probe-level log2-fold changes were summarized using a median polish approach to compute gene-level log2-fold change values.

Details of the S1500+ gene set selection approach can be found in the publication:
Mav D, Shah RR, Howard BE, Auerbach SS, Bushel PR, Collins JB, et al. A hybrid gene selection approach to create the S1500+ targeted gene sets for use in high-throughput transcriptomics. PLoS One. 2018 Feb 20;13(2):e0191105. doi: 10.1371/journal.pone.0191105.

Consensus Strategy for Selection of the Human S1500+ Gene Set - Hybrid Approach Modules

The first module computes a diversity importance score (DIS) for each gene as follows:

  1. Perform principal component analysis (PCA) and generate clusters of experiments using k-means clustering of primary principal components (PCs) (smallest set of PCs that collectively retain specified variability).
  2. Within each cluster, perform PCA and compute cluster-level DIS for each gene defined as square sum of all primary PC loading coefficients.
  3. Aggregate cluster-level DIS using Tukey biweight mean to obtain one DIS per gene.

The second module computes the co-expression importance score (CIS) for each gene using n-fold cross-validation-like approach, thus all experiments are partitioned into mutually exhaustive and exclusive folds of the same size. The unsupervised hierarchical clustering dendrogram (generated using a Pearson correlation matrix after leaving out fold-specific experiments) is cut for each fold to obtain gene-modules. Fold-level CIS for each gene is computed as mean squared Pearson correlation using fold-specific experiments only. Finally, the fold-level gene specific CIS values are aggregated using Tukey biweight mean to compute CIS.

The third module computes an overall importance score (OIS) for each individual gene by taking mean squared ranks of CIS and DIS.

The fourth module utilizes pathway annotations from Molecular Signature Database (MSigDB v 4.0) to ensure coverage of the curated gene sets (C2) in canonical pathways (CP). The gene selection is updated iteratively by identifying best inclusion and best exclusion gene candidates such that all pathways (C2:CP) are represented by at least three genes.

These four modules provide a computational selection of the top 1500 "Sentinel" genes, referred to as the data-driven derived S1500 gene set.

Additional genes were nominated by members of the scientific community in response to a Federal Register notice requesting nominations for genes of importance in toxicological and disease processes. Over 2,000 unique genes were nominated and from those an additional 1,239 genes were selected to be added to the S1500 gene set to yield the S1500+ gene set of 2,739 unique genes.

An extrapolation module was developed to compute a transformation matrix to extrapolate the expression fold-changes of unselected genes from expression fold-changes of the actual measurements of the selected genes. This matrix is derived using principal component regression where primary principal components constructed from a selected gene data matrix are regressed. Like regularized regression, this approach avoids over fitting but is computationally less intensive while working with high-dimensional data.

In the performance evaluation, the Tox21 Working Group computed several gene-level, pathway-level metrics to assess the extrapolation capability of a given selection.

This method uses publicly available human gene expression data and computes a score for every gene where the score represents a gene's importance in representing the transcriptional diversity, correlation of gene expression with other genes, and known pathway annotation.

The Tox21 Working Group performed 20-fold cross validation and tested the performance using a variety of parameters including Pearson correlation (gene expression fold change values), concordance rate (pathway activity calls), and significance overlap (top differentially expressed genes), where the actual microarray data from the test set were compared to the extrapolated data generated by NTP's Tox21 method. Results indicate that our method can select 1500 genes with full pathway coverage and that the S1500 and the S1500+ gene sets can predict the pathway level changes with high accuracy (average Pearson correlation of 0.81 and 0.87, respectively, concordance rate of 0.87 and 0.90, respectively, and significance overlap of 0.52 and 0.60, respectively, all of which significantly exceeded the performance of randomly selected gene sets of equal size).

The performance of the finalized "S1500+" (1500 + nominated genes) gene set was evaluated on an independent external data set of Affymetrix human gene expression data. Pathway level enrichment scores and pathway-level Kolomogorov-Smirnov (KS) test significance p-values were computed to identify significantly perturbed pathways using both the Affymetrix measured values and the extrapolated values based on the Affymetix values for the S1500+ gene set from each microarray. The analysis showed that when using the S1500+ gene set, significant pathways were identified with a recall rate (defined as the percent of truly differentially expressed pathways that are correctly detected as differentially expressed) of 72% (13 out of 18) and a precision rate (defined as percent of detected differentially expressed pathways that are truly differentially expressed) of 72% (13 out of 18) but that when using a randomly selected gene set of the same size, the recall was similar at 77% (14 out of 18), but the precision was much lower at 33% (14 out of 43 pathways), with many more pathways incorrectly identified as enriched.

S1500+ Platform Annotation Files