Title: Shrinkage-Based Similarity Metric for Cluster Analysis of Microarray Data (NYU-CS-TR845) Authors: Vera Cherepinsky, Jiawu Feng, Marc Rejali, and Bud Mishra Abstract: The current standard correlation coefficient used in the analysis of microarray data, including gene expression arrays, was introduced in [1]. Its formulation is rather arbitrary. We give a mathematically rigorous derivation of the correlation coefficient of two gene expression vectors based on James-Stein Shrinkage estimators. We use the background assumptions described in [1], also taking into account the fact that the data can be treated as transformed into normal distributions. While [1] uses zero as an estimator for the expression vector mean μ, we start with the assumption that for each gene, μ is itself a zero-mean normal random variable (with a priori distribution N(0,τ2)), and use Bayesian analysis to update that belief, to obtain a posteriori distribution of μ in terms of the data. The estimator for μ, obtained after shrinkage towards zero, differs from the mean of the data vectors and ultimately leads to a statistically robust estimator for correlation coefficients. To evaluate the effectiveness of shrinkage, we conducted in silico experiments and also compared similarity metrics on a biological example using the data set from [1]. For the latter, we classified genes involved in the regulation of yeast cell-cycle functions by computing clusters based on various definitions of correlation coefficients, including the one using shrinkage, and contrasting them against clusters based on the activators known in the literature. In addition, we conducted an extensive computational analysis of the data from [1], empirically testing the performance of different values of the shrinkage factor γ and comparing them to the values of γ corresponding to the three metrics adressed here, namely, γ=0 for the Eisen metric, γ=1 for the Pearson correlation coefficient, and γ computed from the data for the Shrinkage metric. The estimated "false-positives" and "false-negatives" from this study indicate the relative merits of clustering algorithms based on different statistical correlation coefficients as well as the sensitivity of the clustering algorithm to small perturbations in the correlation coefficients. These results indicate that using the shrinkage metric improves the accuracy of the analysis. All derivation steps are described in detail; all mathematical assertions used in the derivation are proven in the appendix. [1] Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998), PNAS USA 95, 14863-14868.