Many of you will be aware of the Project Tycho visualisation(s) about vaccination that were featured in the Wall Street Journal – if you’re not, you should be. They’re beautiful, and almost everything a good visualisation should be:

The plots shown on the WSJ website are actually interactive and use Javascript, which is a bit beyond the scope of this blog post, but I wanted to show how you can create something similar.

**The Data**

I can’t give you the data due to T&Cs, but you can download it yourself. Go to Project Tycho, register, log in, choose level 2 data, click “search and retrieve data”, choose state rather than city, add all of the states to the selection box, unclick graph and Submit query. There is then an option to download in Excel. The Excel is actually CSV and we can import this into R quite easily. This is what (a subset of) my data look like in R:

YEAR WEEK ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT
1096 1930 1 7 0 0 4 196 178 18 64
1097 1930 2 24 0 0 0 2 442 69 62
1098 1930 3 28 0 0 2 9 490 26 44
1099 1930 4 21 0 0 1 3 628 40 23
1100 1930 5 47 0 0 5 7 864 101 25
1101 1930 6 63 0 0 6 6 943 101 24
1102 1930 7 108 0 0 5 5 954 65 20
1103 1930 8 167 0 0 2 9 1151 170 13
1104 1930 9 191 0 0 7 15 1433 150 23
1105 1930 10 242 0 0 5 8 1514 256 39

We need to do a few things – limit data to beyond 1930, substitute the “-” for zeros, convert everything to numeric, and sum over weeks:

d[d=="-"] <- 0
for (i in 2:61) {
d[,i] <- as.numeric(d[,i])
}
d <- d[d$YEAR>=1930,]
yd <- aggregate(d[,3:61], by=list(year=d[,1]), sum)
yd <- yd[order(yd$year),]

My data now look like this:

year ALABAMA ALASKA AMERICAN.SAMOA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE
1 1930 4389 0 0 2107 996 43585 11821 1978 264
2 1931 8934 0 0 2135 849 27807 4787 12869 2428
3 1932 270 0 0 86 99 12618 2376 5701 39
4 1933 1735 0 0 1261 5438 26551 297 5430 205
5 1934 15849 0 0 1022 7222 25650 12946 4659 3063
6 1935 7214 0 0 586 1518 28799 12786 24149 1040
7 1936 572 0 0 2378 107 49092 604 3987 1490
8 1937 620 0 0 3793 322 5107 1027 10181 1796
9 1938 13511 0 0 604 6690 19452 8403 1388 492
10 1939 4381 0 0 479 1724 67180 5182 15817 159

**Creating the visualisation**

It’s important to note that great visualisations don’t just happen; they take work; they evolve over time; there is trial-and-error until you get what you want. Don’t be scared of this, embrace it!

I’m going to use the heatmap.2() function from gplots in R. Let’s look at a basic, raw heatmap (with some things turned off that I always turn off in heatmap.2()):

library(gplots)
heatmap.2(as.matrix(yd[,2:60]), trace="none", key=FALSE)

OK… I don’t actually want to cluster years, nor do I want to cluster states, so I should turn off the dendrograms. I also have the data the wrong way round, so I need to transpose the matrix:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE)

OK, we start to see a patter where there are more yellow squares to the left of the diagram; yellow in this context means high, and left means early in the period we’re looking at – so we begin to see patterns. Still a terrible heatmap. There is lots of wasted space; there are no year labels; and the state names are being cut-off. Let’s deal with those:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE,
labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12))

I get an error about the figure margins being too large, but this is quite common and can often be ignored

We’re getting somewhere now. It’s time to think about colours.

**Colours**

Let’s get something awkward out of the way. The guys who created this visualisation cheated a little bit. Let’s look at the colour scale:

This is **not** a natural colour scale. White-through-to-blue is fine, but blue does not naturally flow into yellow and then red. So this is a **false** color scheme which has been designed for maximum impact – low numbers of measles cases get nice colours such as white, blue and green; high numbers get danger colours such as yellow and red. Is this acceptable? I’ll let you decide.

In addition, note the WSJ figures have been scaled per 100,000 population. I do not have the historical population figures for American states going back to 1930, so we are plotting the raw data (raw number of cases) not scaled for population. So we can’t completely recreate the WSJ figure, but we can get close.

**Colours and breaks**

In heatmaps, we have the concept of colours and breaks. Breaks are provided as a vector of numbers, and essentially, we use each colour to represent values that fall between each subsequent pair of breaks. In other words, the colours fit “between” pairs of breaks. Perhaps best explained by example. Let’s say our breaks vector is 0, 100, 200, and our colours are “green” and “red”. Therefore, any value between 0 and 100 will be green in the heatmap, and any value between 100 and 200 will be red. As colours fit in the middle between breaks, there should be one less colour than there are breaks. I often find it useful to set my own breaks and colours in heatmaps – the function can decide for you, but how do you know what assumptions it is making?

**Measles**

If I want to recreate the colours of the WSJ plot, I need low values between white and blue, and medium-to-high values between yellow and red. My minimum and maximum values are 0 and 126221:

min(yd[,2:60])
max(yd[,2:60])

I’m going to use colorRampPallette() to create ten colours between white and blue; then 30 colours between yellow and red. I’m also going to game the breaks, just so I can get something similar to the WSJ look:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
colorRampPalette(c("yellow", "red"))(30))

I now need to set my breaks. I want white to be zero, so my first two breaks should be 0 and 0 (i.e. only values between 0 and 0 will be white; in other words, only 0 will be white). I’m then going to scale up to 700 in steps of 100; and the rest I am going to split equally between 800 and the maximum value (which I’ll set at 127000).

bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,length=32))

**NOTE: this is a bit dodgy. Really, my bins should be the same size, but it’s the only way I could recreate the WSJ look with this data.**

Now the heatmap:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE,
labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
col=cols, breaks=bks)

OK! Now we’re getting somewhere! So, the WSJ heatmap has gaps between each row and between each column, so let’s add those in:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE,
labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white")

We want a line at 1961, which after a few bits of trial-and-error, turns out to be at x=32:

heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE,
labCol=yd$year, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
col=cols, breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white",
add.expr=lines(c(32,32),c(0,1000),lwd=2))

Now let’s tidy up those year labels:

row.labels <- rep("", 72)
row.labels[c(1,11,21,31,41,51,61,71)] <- c("1930","1940","1950","1960","1970",
"1980","1990","2000")
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE,
labCol=row.labels, cexCol=1, lhei=c(0.1,1), lwid=c(0.1,1), margins=c(5,12),
col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
add.expr=lines(c(32,32),c(0,1000),lwd=2))

We really should add a title, and we’ll need to go back and make room for it using the lhei parameter:

par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE,
labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
add.expr=lines(c(32,32),c(0,1000),lwd=2),
main='Measles cases in US states 1930-2001nVaccine introduced 1961')

And there we have it! I think this is about as close to the WSJ viualisation as I can get!

The only thing I would say is that I am disappointed that I had to game the break points to get here. It doesn’t sit well with me. However, it was the only way I could figure out how to get so much blue into the left hand side of the graph, whilst still keeping the right side white.

The simple fact is, that keeping the break points a similar size (100), the graph looks better:

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10),
colorRampPalette(c("yellow", "red"))(1261))
bks <- c(0,0,100,200,300,400,500,600,700,seq(800,127000,by=100))
par(cex.main=0.8)
heatmap.2(as.matrix(t(yd[,2:60])), Rowv=NULL, Colv=NULL,
dendrogram="none", trace="none", key=FALSE,
labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
add.expr=lines(c(32,32),c(0,1000),lwd=2),
main='Measles cases in US states 1930-2001nVaccine introduced 1961
n(data from Project Tycho)')

So, there we are 🙂 A final (larger version) can be found here.