National Toxicology Program

National Toxicology Program

S1500 Gene Set Strategy

Consensus Strategy for Selection of the Human S1500 Gene Set

For the selection of a set of human sentinel genes (also referred to as the human S1500 gene set), the Tox21 Working Group downloaded from GEO 3339 gene expression series that utilized the HG-U133plus2 Affymetrix microarray platform. Manual curation of this data was performed by pairing control and corresponding non-control samples using experimental design and other relevant information provided in the GEO database and/or associated manuscripts. The manually curated “.CEL” samples were processed using RMA normalization and probe-level log2-fold changes were summarized using a median polish approach to compute gene-level log2-fold change values.

The Tox21 Working Group developed and used a hybrid approach comprised of five sequential modules to identify the optimal set of genes that best represents biological diversity, addresses gene-gene co-expression relationships, and represents known pathways adequately. This hybrid approach accurately balances data-driven and knowledge-based evidence while allowing for performance assessment of the selected genes with respect to the gene set’s ability to extrapolate whole transcriptome changes, both at the individual gene level and pathway level.

Back to Top

Hybrid Approach Modules

The first module computes a diversity importance score (DIS) for each gene as follows:

  1. Perform principal component analysis (PCA) and generate clusters of experiments using k-means clustering of primary principal components (PCs) (smallest set of PCs that collectively retain specified variability).
  2. Within each cluster, perform PCA and compute cluster-level DIS for each gene defined as square sum of all primary PC loading coefficients.
  3. Aggregate cluster-level DIS using Tukey biweight mean to obtain one DIS per gene.

The second module computes co-expression importance score (CIS) for each gene using n-fold cross-validation-like approach where all experiments are partitioned into mutually exhaustive and exclusive folds of same size. For each fold, we cut the hierarchical clustering dendrogram (generated using a Spearman correlation matrix after leaving out fold-specific experiments) to obtain gene-modules. Next, fold-level CIS for each gene is computed as mean squared Pearson correlation using fold-specific experiments only. Finally, the fold-level gene specific CIS values are aggregated using Tukey biweight mean to compute CIS.

A composite importance score for each individual gene is computed by taking mean squared ranks of CIS and DIS. 

The third module utilizes pathway annotations from Molecular Signature Database (MSigDB v 4.0, The gene selection is updated iteratively by identifying best inclusion and best exclusion gene candidates such that all pathways (C2:CP) are represented by at least three genes.

The extrapolation module computes the transformation matrix that can be used to extrapolate the expression fold-changes of unselected genes from expression from selected genes. This matrix is derived using principal component regression where primary principal component constructed from selected gene data matrix are regressed. Just like regularized regression, this approach avoids over fitting but is computationally less intensive while working with high-dimensional data.

Lastly, in the performance evaluation module, the Tox21 Working Group computed several gene-level, pathway-level metrics to assess the extrapolation capability of a given selection.

This method uses publicly available human gene expression data and computes a score for every gene where the score represents a gene’s importance in representing the transcriptional diversity, correlation of gene expression with other genes, and known pathway annotation.

The Tox21 Working Group performed 20-fold cross validation and tested the performance using a variety of parameters including Pearson correlation (gene expression fold change values), concordance rate (pathway activity calls), and significance overlap (top differentially expressed genes), where the actual microarray data from the test set were compared to the extrapolated data generated by our method. Results indicate that our method can select 1500 genes with full pathway coverage and can predict the gene expression of rest of the transcriptome with high accuracy (average Pearson correlation of 0.79, concordance rate of 0.84, significance overlap of 0.5, all of which significantly exceeded the performance of randomly selected gene sets of equal size). The performance was evaluated on large Affymetrix data sets of rat gene expression, and then the approach was executed on the entire human Affymetrix curated GEO data sets and a list of 1500 (S1500) genes was selected. Inclusion of nominated genes was evaluated and included where performance was enhanced. The performance of the finalized “S1500” (1500 + nominated genes) gene set will be additionally evaluated on an independent external data set of human gene expression data.

NTP is located at the National Institute of Environmental Health Sciences, part of the National Institutes of Health.