GSEAPreranked (v4)

Runs the gene set enrichment analysis against a user-supplied ranked list of genes.

Author: Chet Birger, Broad Institute

Contact:

gp-help@broadinstitute.org

Algorithm Version:

Introduction

GSEAPreranked runs Gene Set Enrichment Analysis (GSEA) against a user-supplied, ranked list of genes.  It determines whether a priori defined sets of genes show statistically significant enrichment at either end of the ranking.  A statistically significant enrichment indicates that the biological activity (e.g., biomolecular pathway) characterized by the gene set is correlated with the user-supplied ranking.

Details

Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data.  It evaluates cumulative changes in the expression of groups of multiple genes defined based on prior biological knowledge. 

The GSEAPreranked module can be used to conduct gene set enrichment analysis on data that do not conform to the typical GSEA scenario. For example, it can be used when the ranking metric choices provided by the GSEA module are not appropriate for the data, or when a ranked list of genomic features deviates from traditional microarray expression data (e.g., GWAS results, ChIP-Seq, RNA-Seq, etc.).

The user provides GSEAPreranked with a pre-ranked gene list.  Paired with each gene in the list is the numeric ranking statistic, which GSEAPreranked uses to rank order genes in descending order. GSEAPreranked calculates an enrichment score for each gene set.  A gene set’s enrichment score reflects how often members of that gene set occur at the top or bottom of the ranked data set (for example, in expression data, in either the most highly expressed genes or the most underexpressed genes).

Avoid using GSEAPreranked to collapse your ranked list to gene symbols.

In order to calculate enrichment scores, GSEA needs to match genes from gene sets to those in your input ranked list. Typically, GSEA is run using gene sets from MSigDB, which consist of human gene symbols. If the input data contain other types of identifiers, such as Affymetrix probe set identifiers, they need to be converted to gene symbols to match the identifiers in MSigDB sets. The standard GSEA module provides the collapse dataset option to perform this conversion, which includes handling the case of several feature identifiers mapping to the same gene identifier. However, this option was developed and tuned with gene expression data in mind, whereas the numbers in a user-defined ranked list represent a metric that was computed by an unspecified ranking procedure outside of GSEA. Therefore, while GSEAPreranked also provides a collapse dataset option, we recommend you provide a ranked list that already has unique human gene symbols and select false (the default value) for the parameter collapse data

The ranked list must not contain duplicate ranking values.

Duplicate ranking values may lead to arbitrary ordering of genes and to erroneous results.  Therefore, it is important to make sure that the ranked list contains no duplicate ranking values.

Permutation test

In GSEAPreranked, permutations are always done by gene set. In standard GSEA, you can choose to set the parameter Permutation type to phenotype (the default) or gene set, but GSEAPreranked does not provide this option.

Understand and keep in mind how GSEAPreranked computes enrichment scores.

The GSEA PNAS 2005 paper introduced a method where a running sum statistic is incremented by the absolute value of the ranking metric when a gene belongs to the set. This method has proven to be efficient and facilitates intuitive interpretation of ranking metrics that reflect correlation of gene expression with phenotype. In the case of GSEAPreranked, you should make sure that this weighted scoring scheme applies to your choice of ranking statistic. If in doubt, we recommend using a more conservative scoring approach by setting scoring scheme parameter to classic; however, the scoring scheme parameter’s default value is weighted, the default value employed by the GSEA module.  Please refer to the GSEA PNAS 2005 paper for further details.

References

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43);15545-15550. (link)

Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesivor JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC.  PGC-1-α responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267-273. (link)

GSEA User Guide: http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

Parameters

NOTE: Certain parameters are considered to be "advanced"; that is, they control details of the GSEAPreranked algorithm that are typically not changed. You should not override the default values unless you are conversant with the algorithm.  These parameters are marked "Advanced" in the parameter descriptions.

Name Description
ranked list * This is a file in RNK format that contains the rank ordered gene (or feature) list.
gene sets database

This drop-down allows you to select gene sets from the Molecular Signatures Database (MSigDB) on the GSEA website.  This provides access to only the most recent versions of MSigDB. 

If you want to use files from earlier versions of MSigDB, you will need to download that file from the archived releases on the website and specify it in the gene sets database file parameter.

If you do not select an option here, you MUST upload a file in the gene sets database file parameter.
gene sets database file Allows you to upload a gene set file not available in the current version of MSigDB (and thus not listed in the gene sets database parameter drop-down).  This file must be in GMT, GMX, or GRP format. 
number of permutations * Specifies the number of permutations to perform in assessing the statistical significance of the enrichment score. It is best to start with a small number, such as 10, in order to check that your analysis will complete successfully (e.g., ensuring you have gene sets that satisfy the minimum and maximum size requirements and that the collapsing genes to symbols works correctly). After the analysis completes successfully, run it again with a full set of permutations. The recommended number of permutations is 1000. Default: 1000
collapse dataset *

Select true to have GSEAPreranked collapse each feature in the ranked list into a single line of data for the gene, which is identified by its HUGO gene symbol. Be sure that your gene sets and array annotations also use gene symbols as the gene identifier format. 

Select false (RECOMMENDED) to use your expression dataset as is, with its native feature identifiers. When you select this option, the chip annotation file (chip platform parameter) is optional and you must specify a gene set file (gene sets database file parameter) that identify genes using the same feature (gene or probe) identifiers as is used in your expression dataset.

Default: false
chip platform

This drop-down allows you to specify the chip annotation file, which lists each probe on a chip and its matching HUGO gene symbol, used for the expression array.  The chip files listed here are from the GSEA website: http://www.broadinstitute.org/gsea/downloads.jsp

If you used a chip file not listed here, you will need to upload and specify it in the chip platform file parameter.

This parameter is not required if collapse dataset is false
chip platform file Chip to use. Upload a chip file if your chip is not listed as a choice for the chip platform parameter.
scoring scheme *

The enrichment statistic.  This parameter affects the running-sum statistic used for the enrichment analysis, controlling the value of p used in the enrichment score calculation.  Options are:

  • classic: p=0
  • weighted (default): p=1; a running sum statistic that is incremented by the absolute value of the ranking metric when a gene belongs to the set (see the 2005 PNAS paper for details)
  • weighted_p2: p=2
  • weighted_p1.5: p=1.5
max gene set size * After filtering from the gene sets any gene not in the expression dataset, gene sets larger than this are excluded from the analysis. Default: 500
min gene set size * After filtering from the gene sets any gene not in the expression dataset, gene sets smaller than this are excluded from the analysis. Default: 15
normalization mode *

Method used to normalize the enrichment scores across analyzed gene sets. Options are:

  • meandiv (default): GSEA normalizes the enrichment scores as described in Normalized Enrichment Score (NES) in the GSEA User Guide.
  • None: GSEA does not normalize the enrichment scores.
collapsing mode

Collapsing mode for sets of multiple probes for a single gene. Used only when the collapse dataset parameter is set to true. Select the expression values to use for the single probe that will represent all probe sets for the gene. Options are:

  • Max_probe (default): For each sample, use the maximum expression value for the probe set.  That is, if there are three probes that map to a single gene, the expression value that will represent the collapsed probe set will be the maximum expression value from those three probes.
  • Median_of_probes: For each sample, use the median expression value for the probe set.
omit features with no symbol match Used only when collapse dataset is set to true. By default (true), the new dataset excludes probes/genes that have no gene symbols. Set to false to have the new dataset contain all probes/genes that were in the original dataset. 
make detailed gene set report * Create detailed gene set report (heat map, mountain plot, etc.) for each enriched gene set. Default: true
num top sets * GSEAPreranked generates summary plots and detailed analysis results for the top x genes in each phenotype, where x is 20 by default. The top genes are those with the largest normalized enrichment scores. Default: 20
random seed * Seed used to generate a random number for phenotype and gene_set permutations. Timestamp is the default. Using a specific integer-valued seed generates consistent results, which is useful when testing software.
output file name * Name of the output file. The name cannot include spaces. Default: <expression.dataset_basename>.zip

* - required

Input Files

1. ranked list:  RNK file

This file contains the rank ordered gene (or feature) list.

2. gene sets database file: GMTGMX, or GRP file (optional, if you do not select a gene set database from the drop-down)

A gene set file not available in the current version of MSigDB (and thus not listed in the gene sets database parameter drop-down).

3. chip platform file: CHIP file (optional, if you do not select a chip platform from the drop-down)

A chip annotation file not available in the module drop-down list.

Output Files

1. ZIP file containing the result files

For more information on interpreting these results, see Interpreting GSEA Results in the GSEA User Guide.

Platform Dependencies

Task Type:
Gene List Selection

CPU Type:
any

Operating System:
any

Language:
Java

Version Comments

Version Release Date Description
4 2016-02-04 Updated to give access to MSigDB v5.1
3 2015-12-04 Updating the GSEA jar to deal with an issue with FTP access. Fixes an issue for GP@IU.
2 2015-06-16 Updated for MSigDB v5.0 and hallmark gene sets support.
1 2013-06-17 Initial Release