SPECLUST

Information

Latest news

July 2018 Updated SPECLUST to version 1.2. SPECLUST was moved to a new host which triggered some code changes. The tree plot in the 'Cluster identification' module is restored and visible again.
November 28 2012 Fixed a regression introduced during change of server, September 2012. Output for consensus peak list may again contain several peaks from one list.
February 13 2011 If a consensus peak list (in the consensus output) contains several peaks from one peak list, all peaks are included in output, semicolon separated.
May 16 2007 Bug fixed in Peaks in common. Links to result files are now valid also in Internet Explorer.
March 27 2007 Cluster identification application has been added.
Output format of consensus peak list has been modified and contains more information than before.
April 19 2006 Bug fixed. Can now read file names containing spaces.
February 25 2006 Peaks in common application has been modified. Output format of multiple has been changed and is now more condensed. Also, a consensus peak list is available.

How to cite

Please cite this paper if you use SPECLUST

R. Alm, P. Johansson, K. Hjernø, C. Emanuelsson, M. Ringnér, and J. Häkkinen
Detection and identification of protein isoforms using cluster analysis of MALDI-MS mass spectra
Journal of Proteome Research 5, 785-792 (2006)

Data format

Uploaded file must be a zip file containing mass spectrum files. Each spectrum file must have suffix '.peaks', contain only peak masses, and use white space to delimit the masses. All files without suffix '.peaks' will be ignored. The zip file can contain a directory structure; all .peaks files will be extracted for use in the clustering.

The peak masses are given in decimal format (e.g. 1.00794) or in scientific format (e.g. 1.6605e-27). Do not use raw spectrum files. Extract peaklists from raw spectra with software such as M/Z (MoverZ), Piums, or similar. Here is an example file that yields the following dendrogram, pairwise, multiple, and consensus results using default parameter settings.

Peak match score

The peak match score reflects the probability that two peaks, with measured masses m and m' and measurement uncertainty σ, originate from the same peptide. The score is zero for measurements infinitely apart and unity for measurements being identical. The score is defined to be the probability to get a mass difference equal or larger than |m-m'|, given that the difference is only due to measurement errors and that the measurement errors are Gaussian. This gives a peak match score: 1-erf(|m-m'|/2σ).

Clustering

Each peak list is initially assigned to its own cluster. Distances are calculated between each pair of peak lists. The closest pair is found and merged to a new cluster. Distances between the new cluster and each of the old clusters are calculated. This procedure is repeated until there is one single cluster. The output of this hierarchical clustering is a dendrogram.

Similarity score

The similarity score, S, between two peak lists is calculated as Σsij, where sij is the peak match score between peak i in first list and peak j in second list. Each peak can only be matched to one other peak, and peak order (by mass) cannot be permuted (i.e. if peaks m and M from the first list are matched to peaks m' and M' from the second list, respectively, the only permissible relationship of their masses are m < M , m' < M' or m > M , m' > M' ). The matches are chosen to maximize the similarity score by using the Needlemann and Wunsch algorithm.

Metric

Three metrics to calculate a distance based on the similarity score are available. They differ in how they treat the case when the number of peaks, N, in the first list is different from the number of peaks, N', in the second list. The liberal distance is calculated as d = 1 - S/min(N,N'). Intuitively, this distance measure corresponds to the fraction of peaks in the smaller peak list having no match in the larger peak list. The conservative distance is calculated as d = 1 - S/max(N,N'). Intuitively, this distance measure corresponds to the fraction of peaks in the larger peak list having no match in the smaller peak list. The correlation-based distance is calculated as d = 1 -  S/(N·N')1/2 and is an intermediate between the other two metrics.

Linkage

Three linkage methods are available to calculate the distance from the merged cluster to another cluster. If the distances from the two clusters (that were merged) to the other cluster were d and d' respectively, in single linkage the distance D between the merged cluster and the other cluster is calculated as D=min(d,d'). In complete linkage the distance is calculated as D=max(d,d'). In average linkage the distance is calculated as D=(n·d+n'·d')/(n+n'), where n and n' are the number of lists in the two clusters.

Cluster identification

Clusters are identified in the dendrogram produced by the clustering application. A cut-off value for the distance in the dendrogram is used. Mass spectra joined in nodes at distances below the cut-off are considered a cluster. Each cluster can be submitted to the peaks in common application for further analysis. This application becomes available as part of the output from the clustering application.

Peaks in common

Pairwise

For each pair of peak lists the matched peaks are printed. In the example below, a peak list A is aligned to a peak list B. Peak list A contains peaks with masses 845.127, 861.112, 932.192, and 2470.57. Peak list B contains peaks with masses 845.088, 861.099, and 2470.34. The three pair of matched peaks that have a peak match score above the 'pairwise score cutoff' are printed together with their peak match score and average mass. Note, peaks not being matched (i.e. not matched in the alignment of the two peak lists) or having a peak match score below the 'pairwise score cutoff' are not printed.

A.peaks 845.127 861.112 2470.57
B.peaks 845.088 861.099 2470.34
Score: 0.977999 0.992666 0.875151
Average: 845.108 861.105 2470.45

Multiple

For each peak a total score is calculated as Σs/(N-1), where the sum runs over scores that are larger than 'pairwise score cutoff' and correspond to matched peaks, in other words, scores that are printed in 'Pairwise' output. N is the number of peak lists, which means the largest possible total score is 1.0 and that only occurs when the peak has a perfect match in each of the other N-1 peak lists. For each peak list the peaks with a score larger than 'Multiple score cutoff' are printed. In the example below peak lists A, B, C, and D have been aligned to each other. Peak list A contains peaks with masses 845.127, 861.112, 932.192, and 2470.57, but only peaks 845.127 and 861.112 are printed, because the two other peaks have a too small total score. Peak 861.112, on the other hand, has been matched to each of the 3 other peak lists and has a total score close to unity (0.954583).

A.peaks 845.127 861.112
0.649049 0.954593
B.peaks 845.073 861.116
0.65343 0.954583
C.peaks 845.088 861.099
0.656381 0.949736
D.peaks 861.337
0.871625

Consensus

A consensus peak list is generated by partitioning the peaks into sets of peaks. Sets containing more peaks than the 'Consensus cutoff' are considered consensus peaks and for those statistics such as average and standard deviation are printed. The partitioning into sets is done as follows. The first peak is assigned to set 1. Each peak that was matched to the first peak in 'Pairwise' is assigned to the first peak. Then each peak that was matched to any of the peaks in the first set is assigned to the set 1. This is done repeatedly until each peak having a match to a peak in set 1 is assigned to set 1. The same procedure is done on the remaining peaks to assign peaks for set 2, set 3 etc. In the example below, two consensus peaks has been generated with average peak mass 845.096 and 861.166 respectively. Note, only sets containing more peaks than 'Consensus cutoff' (in this case 2) are printed.

average std N min max A.peaks B.peaks C.peaks D.peaks
845.096 0.0281649 3 845.073 845.127 845.127 845.073 845.088
861.166 0.114495 4 861.099 861.337 861.112 861.116 861.099 861.337