Model-based approaches to analyzing and exploring gene expression data

Prof. Alejandro Murua
Department of Mathematics and Statistics,
University of Montreal, Canada

In this work we present model-based approaches to analyzing microarray gene expression data. We start by describing a flexible Markov random field approach to modeling cDNA microarray images. This model allows for the simultaneous estimation of hybridization and background intensities. An iteration conditional modes ICM-like algorithm is used to estimate the parameters of the model.

Once the intensities are estimated, exploratory analysis of the group structure in the microarray data may help explain the function of unknown genes by relating them to know genes in the same group, or diagnosing a patient according to the pattern observed in the corresponding gene expression data. We describe the application of model-based clustering with Gaussian mixtures for the first task, and of Potts model clustering for the first and second tasks. This latter model was first proposed by Blatt, Wiseman and Domany (1996) as a general clustering method. We built on their work and show that Potts model clustering is linked to kernel K-means and the MNCut methods, and hence it shares their good performance. We also show that a slightly modified version of both Potts model clustering and kernel K-mean (a penalized Potts model clustering and a weighted kernel K-means, respectively), solve the same problem, and introduce an algorithm, a penalized version of the Wolff algorithm, to uncover the cluster structure. We also note the link between kernel-based methods and non-parametric kernel density estimation, and use it to propose several estimates of the kernel bandwidths that improve the performance of the algorithms.

The advantages of using probabilistic models for exploring group structure in this kind of data are numerous. Among the most important is the access to the distribution of the cluster labels, and hence the possibility of drawing statistical inference on the data cluster structure.

The work on cDNA microarray images was done in collaboration with R. Gottardo, J. Besag and M. Stephens. The work on Gaussian model-based clustering was done in collaboration with K. Y. Yeung, C. Fraley, A. E. Raftery and W. L. Ruzzo. The work on Potts model clustering was done in collaboration with L. Stanberry and W. Stuetzle.