GOBO

Information - Sample Prediction

Introduction

The GOBO Sample Prediction module allows a user to test classifiers in certain predefined forms for association with outcome in the combined breast tumor data set. Currently, three types of classifiers are supported, Centroid-based, PAM-based, and Quantile-based. The classifiers must be submitted in a file format suitable for the GOBO application as further specified below. The Sample Prediction module supports Kaplan-Meier analysis, univariate and multivariate analysis of classification groups. In addition, correlation of predictor genes to different co-expressed gene clusters emulating biological processes are performed (Fredlund et al., Breast Cancer Research 2012;14(4):R113.).

A user may select to perform analysis in different subgroups of the full data set including e.g. all tumors, ER-positive tumors only, ER-negative tumors only etc. This selection is referred to as the main data set. During analysis, the performance of a classifier is further investigated in sub stratifications of the main data set including e.g. the molecular subtypes of breast cancer (Hu et al. 2006, Parker et al. 2009) and clinical variables. Possible endpoints in outcome analysis include Overall Survival (OS), Distant Metastasis-Free Survival (DMFS) and Recurrence-Free Survival (RFS).

Format of predictor files

A predictor file for PAM or Quantile classification should be a file with a single column listing gene / probe identifiers. The column header should read ID.

ID
GeneA
GeneB
...

A predictor file for centroid classification should be a tab-delimited text file with the following format:

IDCentroid_ACentroid_B...
GeneANumeric valueNumeric value...
GeneBNumeric valueNumeric value...
............

Gene / probe identifiers should be gene symbols, Entrez Gene Id's or Affymetrix Probe Id's. Gene symbols should follow the guidelines for human gene nomenclature and are generally in upper case (Guidelines for Human Gene Nomenclature).

Names for the classification groups are taken from the headers of the predictor file for centroid classification. In the example above classification groups would be named Centroid_A and Centroid_B.

Special notes

  • Gene weights cannot be used in the Sample Prediction module.
  • Kaplan-Meier analysis is only performed for classification groups that have at least 10 cases with clinical follow-up. If a group contains fewer samples it is omitted. This criterion is calculated for each sub stratification of the main data set. Thus, a group may be present in one subset of samples but missing in another.
  • When viewing Kaplan-Meier plots, multivariate analysis etc., group sizes may vary. The reason is that not all samples have clinical follow-up or clinical information about all covariates. Clinical variables, such as ER-status, are not imputed from gene expression data.
  • Specified covariates in multivariate analysis may be omitted in certain comparisons. E.g. ER-status is omitted when analyzing ER-positive or ER-negative tumors only. Furthermore, a covariate may be omitted if it causes NA or Inf to be returned by the Cox analysis function in R.
  • If the classification method involves correlation to gene expression centroids the following applies for classification. If only one centroid is provided, samples with correlation > correlation cut-off is assigned to the centroid. Samples with correlation < cut-off is assigned to a "non_centroid" group. If multiple centroids are provided, samples with correlation < cut-off to any centroid is assigned to a "unclassified" group and omitted from further analysis.
  • If the classification method involves correlation to gene expression centroids the predictor file must have at least two columns, otherwise no analysis is performed.
  • For data set GSE2603, metastasis-free survival is interpreted as DMFS. For the Chin et al. data set, the distrec variable is interpreted as DMFS. DMFS is not available for data sets GSE3494 and GSE1456. In DMFS mixed with RFS, relapse-free survival is used for these two data sets and DMFS for the other data sets.
  • When gene symbol or Entrez Gene Id are used as identifiers a data set merged on Entrez Gene Id is used for the analysis. When Affymetrix Probe Id is used as identifiers a data set with probe level data is used. However, for consistency molecular subtypes and functional module activity of samples calculated using the Entrez Gene Id merged data is used throughout all analyses.
  • The text run summary lists which predictor genes that have matched to the main data set.

Input variables required for Sample Prediction

  • Specify the location of the predictor file. The predictor file should be in a suitable format as specified above. Gene symbols should follow the guidelines for human gene nomenclature and are generally in upper case (Guidelines for Human Gene Nomenclature).
  • Specify the type of gene identifiers used: Gene Symbol, Entrez Gene Id or Affymetrix Probe ID.
  • Select classification method. Select which classification method that matches the predictor file.
  • Parameters for method Classification. These parameters only apply if the correlation classification method has been selected.
    • Select correlation method. Select whether to use Pearson or Spearman correlation.
    • Specify correlation cut-off. Specify the correlation cut-off to be used to assign samples to the different centroids. See special notes.
  • Parameters for methods PAM and Quantiles. These parameters only apply if the PAM or Quantile method has been selected.
    • Select number of groups. Specify the number of groups for PAM clustering or the number of equally sized quantiles for the Quantile method.
  • Tumor selection. Select which tumor subset that should be used as the main data set for the classification.
  • Select censoring. Select whether the survival data should be censored (5 or 10 years follow-up) or whether the full follow-up specified in the public data sets should be used.
  • Select end-point. Select which endpoint that should be used for survival analysis. Please observe that not all data sets have follow-up information for all of Overall Survival (OS), Distant Metastasis-Free Survival (DMFS) and Recurrence-Free Survival (RFS).
  • Select multivariate parameters. Select covariates for multivariate analysis. Grade stratification implies histological grade 1 or 2 versus grade 3. Age-stratification implies <=50 years or > 50 years. Size stratification implies <=20mm or >20mm.

Current Output from Sample Prediction

  • A tab-delimited text file with results from univariate analysis for sub stratifications of the selected main data set.
  • A tab-delimited text file with logrank P-values for sub stratifications of the selected main data set.
  • A tab-delimited text file with characteristics of the classification groups regarding e.g. clinical annotations and molecular subtypes (Hu-subtypes).
  • A PDF file with characteristics of the classification groups regarding e.g. clinical annotations and molecular subtypes (Hu-subtypes), including Fisher test p-values for the association between classification groups and other stratifications of the samples. This file is only generated when Sample Prediction is run on the full data set.
  • A PDF file showing a summary of the logrank P-values from Kaplan-Meier analysis. P-values are shown as -log10(P) (bars) for all sub stratifications of the selected main data set. Additionally, similar logrank P-values are also shown for the same sub stratifications in each public data set included in selected main data set.
  • A PDF file describing the result of the outcome analysis. This file shows for each sub stratification of the main data set a Kaplan-Meier plot, the distribution of classification groups across included public data sets and results from multivariate analysis if selected.
  • A PDF file showing the correlation of matched predictor genes to eight different co-expressed gene modules emulating known biological processes (Fredlund et al., Breast Cancer Research 2012;14(4):R113.).
  • A text run summary.