GOBO

Information - Gene Set Analysis

Gene Set Analysis Tumors

Introduction

The GOBO Gene Set Analysis Tumors module allows a user to investigate the expression pattern of single, or sets of genes (referred to as gene sets from hereon) across a large set of breast cancers analyzed by Affymetrix U133A arrays. Furthermore, a user can investigate the association with outcome for a gene set by sample stratification based on gene expression quantiles. The Gene Set Analysis Tumors module supports Kaplan-Meier analysis, univariate and multivariate analysis of generated quantile classification groups. In addition, the module also performs correlation of genes in a gene set to different co-expressed gene clusters emulating biological processes (Fredlund et al., Breast Cancer Research 2012;14(4):R113.)

A gene set is defined as comprising either a single or multiple genes. By default, if a gene set comprises multiple genes an average expression is calculated for the gene set. This average expression is next used for investigation across clinical and molecular groups, as well as for stratification of samples into groups based on gene expression quantiles. If a gene set is specified in file format (gene set file see below), gene weights may be associated with each gene. A gene weight is a numerical value (positive or negative) that is multiplied with a gene's expression value prior to summing up the expression values of the genes in the gene set (the weighted average is normalized by the sum of the absolute values of the weights). Consequently, this allows gene sets to comprise genes with supposedly both positive and negative gene expression patterns.

A user may select to perform analysis in different subgroups of the combined Affymetrix breast cancer data set including e.g. all tumors, ER-positive tumors only, ER-negative tumors only etc. This selection of samples is referred to as the main data set from hereon. During analysis, the association with outcome for a gene set is further investigated in sub stratifications of the main data set including e.g. the molecular subtypes of breast cancer (Hu et al. 2006, Parker et al. 2009) as well as clinical subgroups. Thus, it is possible to investigate association with outcome in e.g. ER-positive Luminal A classified breast tumors. However, gene centering is always related to the full data set. Possible endpoints in outcome analysis include Overall Survival (OS), Distant Metastasis-Free Survival (DMFS) and Relapse-Free Survival (RFS).

Format of gene set files

Gene weights can only be used together with a gene set if a gene set file is used. A gene set file without gene weights should preferably be a tab-delimited text file in the following format (ID = header):

ID
GeneA
GeneB
...

Genes / probes in the ID column can be identified using either gene symbols, Entrez Gene Id's or Affymetrix Probe Id's. Gene symbols should follow the guidelines for human gene nomenclature and are generally in upper case (Guidelines for Human Gene Nomenclature).

A gene set file including gene weights should preferably be a tab-delimited text file in the following format (ID and weights = headers):

IDweights
GeneANumerical value
GeneBNumerical value
......

Special notes

  • If missing values are detected in a gene weight column these values are replaced with value = 1.
  • Kaplan-Meier analysis is only performed for classification groups that have at least 10 cases with clinical follow-up. If a group contains fewer samples it is omitted. This criterion is calculated for each sub stratification of the main data set. Thus, a group may be present in one subset of samples but missing in another.
  • When viewing Kaplan-Meier plots, multivariate analysis etc., group sizes may vary. The reason is that not all samples have clinical follow-up or clinical information for all covariates in multivariate analysis. Clinical variables, such as ER-status, are not imputed from gene expression data.
  • Specified covariates in multivariate analysis may be omitted in certain comparisons. E.g. ER-status is omitted when analyzing ER-positive or ER-negative tumors only. Furthermore, a covariate may be omitted if it causes NA or Inf to be returned by the Cox analysis function in R.
  • For data set GSE2603, metastasis-free survival is interpreted as DMFS. For the Chin et al. data set, the distrec variable is interpreted as DMFS. DMFS is not available for data sets GSE3494 and GSE1456. In DMFS mixed with RFS, relapse-free survival is used for these two data sets and DMFS for the other data sets.
  • Classification groups are labeled according to gene expression cut-off values for the corresponding quantile. E.g, [-2.908,-0.479) indicates that expression values in the group ranges from -2.908 to -0.479.
  • When gene symbol or Entrez Gene Id are used as identifiers a data set merged on Entrez Gene Id is used for the analysis. When Affymetrix Probe Id is used as identifiers a data set with probe level data is used. However, for consistency molecular subtypes and functional module activity of samples calculated using the Entrez Gene Id merged data is used throughout all analyses.
  • The text run summary lists which predictor genes that have matched to the main data set.

Input variables required for Gene Set Analysis Tumors

  • Specify gene set either through file or screen upload. Gene symbols should follow the guidelines for human gene nomenclature and are generally in upper case (Guidelines for Human Gene Nomenclature).
    • Specify the location of the gene set file. If a gene set file is used it should be in a suitable format as specified above.
    • Specify gene set through screen upload. Enter genes using gene identifiers (one or more) in a gene set by typing directly into the text box. Delimit genes with tab or new line character. Note that gene weights may only be used if a gene set is provided in file format.
    • Specify the type of gene identifiers used: Gene Symbol, Entrez Gene Id or Affymetrix Probe ID.
  • Tumor selection. Select which tumor subset that should be used as the main data set for the analysis.
  • Select number of groups (Quantiles). Specify the number of quantiles that samples should be stratified into based on gene expression of the gene set. This number equals the number of classification groups generated.
  • Select censoring. Select whether the survival data should be censored (5 or 10 years follow-up) or whether the full follow-up specified in individual public data sets should be used.
  • Select endpoint. Select which endpoint that should be used for outcome analysis. Please observe that not all data sets have follow-up information for all of Overall Survival (OS), Distant Metastasis-Free Survival (DMFS) and Recurrence-Free Survival (RFS).
  • Select multivariate parameters. Select covariates for multivariate analysis. Grade stratification implies histological grade 1 or 2 versus grade 3. Age-stratification implies <=50 years or > 50 years. Tumor size stratification implies <=20mm or >20mm.

Current output from Gene Set Analysis Tumors

  • A PDF file showing expression of the gene set across certain clinical and molecular subgroups of the main data set. In addition, correlation of the expression pattern of the individual genes in a gene set to eight different co-expressed gene modules emulating different biological process is shown.
  • A tab-delimited text file with results from univariate analysis for all sub stratifications of the selected main data set.
  • A tab-delimited text file with logrank P-values for all sub stratifications of the selected main data set.
  • A PDF file showing a summary of the logrank P-values from Kaplan-Meier analysis. P-values are displayed as -log10(P) in a bar plot for all sub stratifications of the selected main data set. Additionally, similar logrank P-values are also shown for the same sub stratifications in each public data set included in the selected main data set.
  • A PDF file describing the result of the outcome analysis. This file shows for each sub stratification of the main data set a Kaplan-Meier plot, the distribution of classification groups across included public data sets and results from multivariate analysis if applicable.
  • A text run summary.



Gene Set Analysis Cell Lines

Introduction

The Gene Set Analysis Cell Lines module of GOBO allows a user to investigate the expression of gene sets across a cell line panel of commonly used breast cancer cell lines (Neve et al. 2006). A gene set is defined similarly as in the Gene Set Analysis Tumors module and calculated in the same fashion including gene weights if supplied (See Gene Set Analysis Tumors for further explanation).

Format of gene set files

Gene weights can only be used together with a gene set if a gene set file is used. A gene set file with or without gene weights should have similar format as for Gene Set Analysis Tumors (see above).

Special notes

  • If missing values are detected in a gene weight column these values are replaced with the value 1.
  • The text run summary lists which predictor genes that have matched to the main data set.

Input variables required for Gene Set Analysis Cell Lines

  • Specify gene set either through file or screen upload as for Gene Set Analysis Tumors.

Current Output from Gene Set Analysis Cell Lines

  • A PDF file showing the expression of the gene set across the breast cancer cell line panel.
  • A text run summary.