Clustering is an important tool in microarray data analysis. This unsupervised learning technique is commonly used to reveal structures hidden in large gene expression data sets. The vast majority of clustering algorithms applied so far produce hard partitions of the data, i.e. each gene is assigned exactly to one cluster. Hard clustering is favourable if clusters are well separated. However, this is generally not the case for microarray time-course data, where gene clusters frequently overlap. Additionally, hard clustering algorithms are often highly sensitive to noise.
To overcome the limitations of hard clustering, we have implemented soft clustering which offers several advantages for researchers. First, it generates accessible internal cluster structures, i.e. it indicates how well corresponding clusters represent genes. This can be used for the more targeted search for regulatory elements. Second, the overall relation between clusters, and thus a global clustering structure, can be defined. Additionally, soft clustering is more noise robust and a priori pre-filtering of genes can be avoided. This prevents the exclusion of biologically relevant genes from the data analysis.
Studies which have used Mfuzz can be found here and here.
- Publication I (pdf): The methodology for soft clustering is introduced in this article (which appeared in Journal of Bioinformatics and Computational Biology). Additionally, the differences between soft and
hard clustering are discussed. As a case study, we applied soft clustering to gene expression data of the yeast cell cycle. Based on this data set several favourable features of soft clustering could be demonstrated.
- Publication II (pdf): A short introduction(which appeared in Bioinformation) to the R/Bioconductor packages Mfuzz and Mfuzzgui.
- Presentation I : Brief presentation of the main feature of soft clustering of gene expression data.
- Presentation II : General introduction to cluster analysis in microarray data analysis with an emphasis on soft clustering of gene expression data.
- PhD thesis chapter : The chapter gives a general introduction to clustering and motivates the use of noise robust clustering methods. It includes the analysis presented in the publication above, but also a extended discussion of future directions.
Soft clustering was implemented here using the fuzzy c-means algorithm. A software package termed Mfuzz for soft clustering has been developed based on the open-source statistical language R. The Mfuzzgui-package provides a convient graphical user interface for most functions implemented in Mfuzz.
Note that most current packages can be obtained at the Bioconductor site .
The latest version also includes functions for the estimation of clustering parameters c and m as proposed in a recent publication.
Mfuzz v.2.3.1(with Mfuzzgui included)
Using ExpressionSet objects (v.1.9.2)
Using exprSet objects (v.1.8.0) for older versions of R/Bioconductor)
Introduction to Mfuzz package (including instructions for installation): Pdf
Mfuzzgui requiring Mfuzz based on ExpressionSet objects.
Mfuzzgui requiring Mfuzz based on exprSet objects.
Introduction to Mfuzzgui package (including instructions for installation): Pdf
For more information about about R and the related Bioconductor project:
Questions and comments regarding the Mfuzz package can be addressed to
Last modified: Fri Mar 30 17:01:39 CEST 2012