bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Your strongly correlated data is probably nonsense

Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.

This is best shown via example in R:

# let's correlate some random data
g1 <- rnorm(50)
g2 <- rnorm(50)

cor(g1, g2)
# [1] -0.1486646

So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.

Let's add in a single, large value:

# let's correlate some random data with the addition of a single, large value
g1 <- c(g1, 10)
g2 <- c(g2, 11)
cor(g1, g2)
# [1] 0.6040776

Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!

It's also significant.

> cor.test(g1,g2, method="pearson")

        Pearsons product-moment correlation

data:  g1 and g2
t = 5.3061, df = 49, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3941015 0.7541199
sample estimates:

So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.

How can you solve this? By using Spearman, of course:

> cor(g1, g2, method="spearman")
[1] -0.0961086
> cor.test(g1, g2, method="spearman")

        Spearmans rank correlation rho

data:  g1 and g2
S = 24224, p-value = 0.5012
alternative hypothesis: true rho is not equal to 0
sample estimates:


  1. Andrzej Zielezinski

    27th April 2016 at 10:46 am

    I would suggest Kendall rank correlation, rather than Spearman’s rho. Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables.

    The main advantages of using Kendall’s tau are as follows:
    1. The distribution of Kendall’s tau has better statistical properties.
    2. The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct. It tells you the difference between the probability that the two variables are in the same order in the observed data versus the probability that the two variables are in different orders.
    3. Dave Howell in Statistical Methods for Psychology, concludes that Kendall’s tau is generally preferred over Spearman’s rho, “because it is a better estimate of the corresponding population parameter, and its standard error is known.”
    4. Roger Newson has argued for the superiority of Kendall’s τ (tau) over Spearman’s rho as a rank-based measure of correlation in a paper: Newson R. Parameters behind “nonparametric” statistics: Kendall’s tau,Somers’ D and median differences. Stata Journal 2002; 2(1):45-64. (http://www.stata-journal.com/article.html?article=st0007). He references (on p47) to Kendall & Gibbons (1990) as arguing that “…confidence intervals for Spearman’s rho are less reliable and less interpretable than confidence intervals for Kendall’s τ-parameters, but the sample Spearman’s rho is much more easily calculated without a computer” (which is no longer of much importance of course).

    Calculation of Kendall’s tau in R: cor.test(x, y, method=”kendall”)

  2. Fair enough.
    Two comments:

    1- not everyone works with 50 genes:
    a <- rnorm(30000)
    b <- rnorm(30000)
    cor(a, b, method = "pearson")
    [1] 0.0009655594
    cor(c(a, 10), c(b, 11))
    [1] 0.004605644

    Relatively massive impact, but still low PCC value.

    2- for some applications, time maters:
    system.time(cor(a, b, method = "pearson"))
    user system elapsed
    0 0 0
    system.time(cor(a, b, method = "spearman"))
    user system elapsed
    0.08 0.01 0.09
    system.time(cor(a, b, method = "kendall"))
    user system elapsed
    111.48 0.09 116.69

  3. Looking at the data is also a good way to prevent this king of things.

  4. Very interesting post and comments.
    Good highlight on limitations.
    Not everybody is working on 30.000 samples 😉

  5. It isn’t 30,000 samples that is the problem—it is 30,000 choose 2 hypotheses!

  6. Sorry, I don’t get your point concerning 30,000.

    It sounds clear that there no valuable impact on Pearson’s correlation of 2 nicely correlated values among 30,000. Among 50 values, the impact is clear.

    Let’s say we compare 2 genes measured on 50 samples. On one side, I may consider that I am not interested in the occurrence of 2 high values on the same sample whereas values of the other samples are low. Spearman is then OK. On the other side, if this occurrence might be of interest, Pearson sounds OK.

    IMHO, it’s a question of choice, between robustness and sensitivity, depending on the question addressed to the data.

  7. The big question is whether the two genes were chosen before looking at the data, in which case correlation on 50 samples can be highly informative, or after looking at the data (picking the highest correlated pair of genes), in which case 50 samples is nowhere near enough to say anything meaningful.

  8. I get your point now. Confirmatory experiment is necessary indeed.
    If the initial point raised by this post concerns the use of Pearson’s correlation in hierarchical clustering, I think there is no risk of getting wrong results. Concerning sample correlation, one over thousands of genes does not change correlation a lot, as shown. Concerning gene correlation, groups of samples consist in ten samples at least, and the correlation between genes is safe also.
    If the point concerns picking best correlations among a matrix of gene correlations in order to induce some networks for example, this post is a valuable warning.
    In all the cases, it’s a good discussion.

Leave a Reply

© 2018 Opiniomics

Theme by Anders NorenUp ↑