Title: Shrinkage-Based Similarity Metric for Cluster Analysis of Microarray Data

(NYU-CS-TR845)

Authors: Vera Cherepinsky, Jiawu Feng, Marc Rejali, and Bud Mishra


Abstract:

The current standard correlation coefficient used in the analysis of 
microarray data, including gene expression arrays, was introduced
in [1].  Its formulation is rather arbitrary. We give a
mathematically rigorous derivation of the correlation coefficient of
two gene expression vectors based on James-Stein Shrinkage estimators.
We use the background assumptions described in [1], also
taking into account the fact that the data can be treated as
transformed into normal distributions. While [1] uses zero
as an estimator for the expression vector mean μ, we start with
the assumption that for each gene, μ is itself a zero-mean normal
random variable (with a priori distribution N(0,τ2)),
and use Bayesian analysis to update that belief, to obtain a
posteriori distribution of μ in terms of the data. The estimator
for μ, obtained after shrinkage towards zero, differs from the
mean of the data vectors and ultimately leads to a statistically
robust estimator for correlation coefficients.  

To evaluate the effectiveness of shrinkage, we conducted in
silico experiments and also compared similarity metrics on a
biological example using the data set from [1].  For the
latter, we classified genes involved in the regulation of yeast 
cell-cycle functions by computing clusters based on various definitions of
correlation coefficients, including the one using shrinkage, and
contrasting them against clusters based on the activators known in the
literature.  In addition, we conducted an extensive computational 
analysis of the data from [1], empirically testing the performance of 
different values of the shrinkage factor γ and comparing them to 
the values of γ corresponding to the three metrics adressed here, 
namely, γ=0 for the Eisen metric, γ=1 for the Pearson 
correlation coefficient, and γ computed from the data for the 
Shrinkage metric.

The estimated "false-positives" and "false-negatives" from this
study indicate the relative merits of clustering algorithms based on
different statistical correlation coefficients as well as the
sensitivity of the clustering algorithm to small perturbations in the
correlation coefficients.  These results indicate that using the
shrinkage metric improves the accuracy of the analysis.

All derivation steps are described in detail; all mathematical assertions 
used in the derivation are proven in the appendix.


[1] Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998), 
PNAS USA 95, 14863-14868.