Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Recreating a famous visualisation

Many of you will be aware of the Project Tycho visualisation(s) about vaccination that were featured in the Wall Street Journal – if you’re not, you should be.  They’re beautiful, and almost everything a good visualisation should be:

wsj_projecttychoThe plots shown on the WSJ website are actually interactive and use Javascript, which is a bit beyond the scope of this blog post, but I wanted to show how you can create something similar.

The Data

I can’t give you the data due to T&Cs, but you can download it yourself.  Go to Project Tycho, register, log in, choose level 2 data, click “search and retrieve data”, choose state rather than city, add all of the states to the selection box, unclick graph and Submit query.  There is then an option to download in Excel.  The Excel is actually CSV and we can import this into R quite easily.  This is what (a subset of) my data look like in R:

     YEAR WEEK ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT
1096 1930    1       7      0              0       4      196        178       18          64
1097 1930    2      24      0              0       0        2        442       69          62
1098 1930    3      28      0              0       2        9        490       26          44
1099 1930    4      21      0              0       1        3        628       40          23
1100 1930    5      47      0              0       5        7        864      101          25
1101 1930    6      63      0              0       6        6        943      101          24
1102 1930    7     108      0              0       5        5        954       65          20
1103 1930    8     167      0              0       2        9       1151      170          13
1104 1930    9     191      0              0       7       15       1433      150          23
1105 1930   10     242      0              0       5        8       1514      256          39

We need to do a few things – limit data to beyond 1930, substitute the “-” for zeros, convert everything to numeric, and sum over weeks:

d[d=="-"] <- 0
for (i in 2:61) {
   d[,i] <- as.numeric(d[,i])
}
d <- d[d$YEAR>=1930,]
yd <- aggregate(d[,3:61], by=list(year=d[,1]), sum)
yd <- yd[order(yd$year),]

My data now look like this:

   year ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE
1  1930    4389      0              0    2107      996      43585    11821        1978      264
2  1931    8934      0              0    2135      849      27807     4787       12869     2428
3  1932     270      0              0      86       99      12618     2376        5701       39
4  1933    1735      0              0    1261     5438      26551      297        5430      205
5  1934   15849      0              0    1022     7222      25650    12946        4659     3063
6  1935    7214      0              0     586     1518      28799    12786       24149     1040
7  1936     572      0              0    2378      107      49092      604        3987     1490
8  1937     620      0              0    3793      322       5107     1027       10181     1796
9  1938   13511      0              0     604     6690      19452     8403        1388      492
10 1939    4381      0              0     479     1724      67180     5182       15817      159

Creating the visualisation

It’s important to note that great visualisations don’t just happen; they take work; they evolve over time; there is trial-and-error until you get what you want.  Don’t be scared of this, embrace it!

I’m going to use the heatmap.2() function from gplots in R.  Let’s look at a basic, raw heatmap (with some things turned off that I always turn off in heatmap.2()):

library(gplots)
heatmap.2(as.matrix(yd[,2:60]), trace="none", key=FALSE)

basic_rawOK… I don’t actually want to cluster years, nor do I want to cluster states, so I should turn off the dendrograms.  I also have the data the wrong way round, so I need to transpose the matrix:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
          dendrogram="none", trace="none", key=FALSE)

trans_rawOK, we start to see a patter where there are more yellow squares to the left of the diagram; yellow in this context means high, and left means early in the period we’re looking at – so we begin to see patterns.  Still a terrible heatmap.  There is lots of wasted space; there are no year labels; and the state names are being cut-off.  Let’s deal with those:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12))

I get an error about the figure margins being too large, but this is quite common and can often be ignored

better_rawWe’re getting somewhere now.  It’s time to think about colours.

Colours

Let’s get something awkward out of the way.  The guys who created this visualisation cheated a little bit.  Let’s look at the colour scale:

wsj_colorsThis is not a natural colour scale.  White-through-to-blue is fine, but blue does not naturally flow into yellow and then red.  So this is a false color scheme which has been designed for maximum impact – low numbers of measles cases get nice colours such as white, blue and green; high numbers get danger colours such as yellow and red.  Is this acceptable?  I’ll let you decide.

In addition, note the WSJ figures have been scaled per 100,000 population.  I do not have the historical population figures for American states going back to 1930, so we are plotting the raw data (raw number of cases) not scaled for population.  So we can’t completely recreate the WSJ figure, but we can get close.

Colours and breaks

In heatmaps, we have the concept of colours and breaks.  Breaks are provided as a vector of numbers, and essentially, we use each colour to represent values that fall between each subsequent pair of breaks.  In other words, the colours fit “between” pairs of breaks.  Perhaps best explained by example.  Let’s say our breaks vector is 0, 100, 200, and our colours are “green” and “red”.  Therefore, any value between 0 and 100 will be green in the heatmap, and any value between 100 and 200 will be red.   As colours fit in the middle between breaks, there should be one less colour than there are breaks.  I often find it useful to set my own breaks and colours in heatmaps – the function can decide for you, but how do you know what assumptions it is making?

Measles

If I want to recreate the colours of the WSJ plot, I need low values between white and blue, and medium-to-high values between yellow and red.  My minimum and maximum values are 0 and 126221:

min(yd[,2:60])
max(yd[,2:60])

I’m going to use colorRampPallette() to create ten colours between white and blue; then 30 colours between yellow and red.  I’m also going to game the breaks, just so I can get something similar to the WSJ look:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(30))

I now need to set my breaks.  I want white to be zero, so my first two breaks should be 0 and 0 (i.e. only values between 0 and 0 will be white; in other words, only 0 will be white).  I’m then going to scale up to 700 in steps of 100; and the rest I am going to split equally between 800 and the maximum value (which I’ll set at 127000).

bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,length=32))

NOTE: this is a bit dodgy.  Really, my bins should be the same size, but it’s the only way I could recreate the WSJ look with this data.

Now the heatmap:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks)

better_coloursOK!  Now we’re getting somewhere!  So, the WSJ heatmap has gaps between each row and between each column, so let’s add those in:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white")

datasepWe want a line at 1961, which after a few bits of trial-and-error, turns out to be at x=32:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2))

datasep_lineNow let’s tidy up those year labels:

row.labels <- rep("", 72)
row.labels[c(1,11,21,31,41,51,61,71)] <- c("1930","1940","1950","1960","1970",
                                           "1980","1990","2000")
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2))

tidy_labelsWe really should add a title, and we’ll need to go back and make room for it using the lhei parameter:

par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001nVaccine introduced 1961')

titleAnd there we have it!  I think this is about as close to the WSJ viualisation as I can get!

The only thing I would say is that I am disappointed that I had to game the break points to get here.  It doesn’t sit well with me.  However, it was the only way I could figure out how to get so much blue into the left hand side of the graph, whilst still keeping the right side white.

The simple fact is, that keeping the break points a similar size (100), the graph looks better:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
          colorRampPalette(c("yellow", "red"))(1261))
bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,by=100))
par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL, 
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001nVaccine introduced 1961
             n(data from Project Tycho)')

final

So, there we are 🙂  A final (larger version) can be found here.

5 Comments

  1. Have you looked at the distribution of the number of cases? It might be possible to transform the data (log?) to make the color scheme better fit

  2. Cool. You could take a screen shot of WSJ, use a color picker and reverse engineer their palette function on a spreadsheet, old school. I figured out a good function for lightening halos like that. They probably customized the ramp themselves anyway.

  3. I did lots of bioinformatics in my postdoctoral work. That was almost ten years ago, but this really took me back and I love how you explain it all. Rock on!

  4. Awesome post – explained well – can’t wait to use that on my data. 🙂

Leave a Reply

Your email address will not be published.

*

*

code

© 2017 Opiniomics

Theme by Anders NorenUp ↑