Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Plotting cool graphs in R

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations.  It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad.  I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream.   I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way.  I can’t help it.  Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them.  Well let’s do that again 🙂

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot:

excel

And the multi plot:

multi

You’re clearly supposed to think the latter is better, and I do; however perhaps disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:


# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),
		 Mental=c(11,11,18,24,23,24,26,23),
		 Musculoskeletal=c(17,18,13,16,12,18,20,26),
		 Cancer=c(10,15,15,14,16,16,14,14))

rownames(df) <- seq(1975,2010,by=5)

df

Now let's plot the graph


# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller
par(mar=c(3,3,0,0))

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="", 
		xlab=NA, ylab=NA, xaxt="n", yaxt="n", 
		ylim=c(0,35), bty="n", xlim=c(1,xmax))

# add horizontal lines
for (i in seq(0,35,by=5)) {
	lines(1:nrow(df), rep(i,nrow(df)), col="grey")
}

# add points and lines 
# for each dataset
for (i in 1:ncol(df)) {

	points(1:nrow(df), df[,i], pch=pch[i], 
		col=cols[i], cex=1.5)

	lines(1:nrow(df), df[,i], col=cols[i], 
		lwd=4)


}

# add bottom axes
axis(side=1, at=1:nrow(df), tick=FALSE, 
		labels=rownames(df))

axis(side=1, at=seq(-0.5,8.5,by=1), 
		tick=TRUE, labels=NA)

# add left axis
axis(side=2, at=seq(0,35,by=5), tick=TRUE, 
		las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

# add legend
legend(8.5,25,legend=colnames(df), pch=pch, 
		col=cols, cex=1.5, bty="n",  lwd=3, lty=1)

And here is the result:

excel_plot

Not bad eh?  Actually, yes, very bad; but also very Excel!

The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.


# split into 2 rows and 2 cols
split.screen(c(2,2))

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

	# select screen
	screen(scr)

	# reduce margins
	par(mar=c(3,2,1,1))

	# empty plot
	plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA, 
		ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35), 
		bty="n")

	# plot all data in grey
	for (j in 1:ncol(df)) {
		lines(1:nrow(df), df[,j], 
		col="grey", lwd=3)

	}	

	# plot selected in blue
	lines(1:nrow(df), df[,i], col="blue4", lwd=4)

	# add blobs
	points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]), 
		pch=16, cex=2, col="blue4")

	# add numbers
	mtext(df[1,i], side=2, at=df[1,i], las=2)
	mtext(df[nrow(df),i], side=4, at=df[nrow(df),i], 
		las=2)	

	# add title
	title(colnames(df)[i])

	# add axes if we are one of
	# the bottom two plots
	if (scr >= 3) {
		axis(side=1, at=1:nrow(df), tick=FALSE, 
			labels=rownames(df))
	}

	# next screen
	scr <- scr + 1
}

# close multi-panel image
close.screen(all=TRUE)

And here is the result:

multi_plot

 


And there we have it.

So which do you prefer?

The unbearable madness of microbiome

This is my attempt to collate the literature on how easy it is to introduce bias into microbiome studies.  I hope to crowd-source papers and add them below under each category.  PLEASE GET INVOLVED.  If this works well we can turn it into a comprehensive review and publish 🙂 Add new papers in the comments or Tweet me 🙂

Special mention to these blogs:

  1. Microbiomdigest’s page on Sample Storage
  2. Microbe.net’s Best practices for sample processing and storage prior to microbiome DNA analysis freeze? buffer? process?
  3. The-Scientist.com’s Spoiler Alert

****************************************

UPDATE 7th August 2016

****************************************

For prosperity the original blog post is below, but I am now trying to manage this through a ZOTERO GROUP LIBRARY.  Please continue to contribute – join the group, get involved!  I think you can join the group here.  I am trying to avoid just listing software papers BTW I would prefer to focus on papers that specifically demonstrate sources of bias.

I won’t be updating the text below.


General

  • “We propose that best practice should include the use of a proofreading polymerase and a highly processive polymerase, that sequencing primers that overlap with amplification primers should not be used, that template concentration should be optimized, and that the number of PCR cycles should be minimized.” (doi:10.1038/nbt.3601)

Sample collection and sample storage

  • “Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples” (24161897)
  • “No significant differences occurred in the culture-based analysis between the fresh, snap or -80°C frozen samples.” (25748176)
  • “In seven of nine cases, the Firmicutes toBacteroidetes 16S rRNA gene ratio was significantly higher in fecal samples that had been frozen compared to identical samples that had not….. The results demonstrate that storage conditions of fecal samples may adversely affect the determined Firmicutes to Bacteroidetes ratio, which is a frequently used biomarker in gut microbiology.” (22325006)
  • “Our results indicate that environmental factors and biases in molecular techniques likely confer greater amounts of variation to microbial communities than do differences in short-term storage conditions, including storage for up to 2 weeks at room temperature” (10.1111/j.1574-6968.2010.01965.x)
  • “The trichloroacetic acid preserved sample showed significant loss of protein band integrity on the SDS-PAGE gel. The RNAlaterpreserved sample showed the highest number of protein identifications (103% relative to the control; 520 ± 31 identifications in RNAlater versus 504 ± 4 in the control), equivalent to the frozen control. Relative abundances of individual proteins in the RNAlater treatment were quite similar to that of the frozen control (average ratio of 1.01 ± 0.27 for the 50 most abundant proteins), while the SDS-extraction buffer, ethanol, and B-PER all showed significant decreases in both number of identifications and relative abundances of individual proteins.” (10.3389/fmicb.2011.00215)
  • “Optimized Cryopreservation of Mixed Microbial Communities for Conserved Functionality and Diversity” (10.1371/journal.pone.0099517)
  • “A previously known bias in FTA(®) cards that results in lower recovery of pure cultures of Gram-positive bacteria was also detected in mixed community samples. There appears to be a uniform bias across all five preservation methods against microorganisms with high G + C DNA. Overall, the liquid-based preservatives (DNAgard(™), RNAlater(®), and DESS) outperformed the card-based methods.” (22974342)
  • “Microbial composition of frozen and ethanol samples were most similar to fresh samples. FTA card and RNAlater-preserved samples had the least similar microbial composition and abundance compared to fresh samples.” (10.1016/j.mimet.2015.03.021)
  • “Bray-Curtis dissimilarity and (un)weighted UniFrac showed a significant higher distance between fecal swabs and -80°C versus the other methods and -80°C samples (p<0.009). The relative abundance of Ruminococcus and Enterobacteriaceae did not differ between the storage methods versus -80°C, but was higher in fecal swabs (p<0.05)” (10.1371/journal.pone.0126685)
  • “We experimentally determined that the bacterial taxa varied with room temperature storage beyond 15 minutes and beyond three days storage in a domestic frost-free freezer. While freeze thawing only had an effect on bacterial taxa abundance beyond four cycles, the use of samples stored in RNAlater should be avoided as overall DNA yields were reduced as well as the detection of bacterial taxa.” (10.1371/journal.pone.0134802)
  • “A key assumption in many studies is the stability of samples stored long term at −80 °C prior to extraction. After 2 years, we see relatively few changes: increased abundances of lactobacilli and bacilli and a reduction in the overall OTU count. Where samples cannot be frozen, we find that storing samples at room temperature does lead to significant changes in the microbial community after 2 days.” (10.1186/s40168-016-0186-x)

DNA extraction

  • “Caution should be paid when the intention is to pool and analyse samples or data from studies which have used different DNA extraction methods.” (27456340)
  • “Samples clustered according to the type of extracted DNA due to considerable differences between iDNA and eDNA bacterial profiles, while storage temperature and cryoprotectants additives had little effect on sample clustering” (24125910)
  • “Bifidobacteria were only well represented among amplified 16S rRNA gene sequences when mechanical disruption (bead-beating) procedures for DNA extraction were employed together with optimised “universal” PCR primers” (26120470)
  • Qiagen DNA stool kit is biased (misses biffidobacteria)  (26120470)
  • “Bead-beating has a major impact on the determined composition of the human stool microbiota.  Different bead-beating instruments from the same producer gave a 3-fold difference in the Bacteroidetes to Firmicutes ratio” (10.1016/j.mimet.2016.08.005)
  • “We observed that using different DNA extraction kits can produce dramatically different results but bias is introduced regardless of the choice of kit.” (10.1186/s12866-015-0351-6)

Sequencing strategy

  • Bifidobacteria were only well represented among amplified 16S rRNA gene sequences … with optimised “universal” PCR primers. These primers incorporate degenerate bases at positions where mismatches to bifidobacteria and other bacterial taxa occur” (26120470)
  • Anything other than 2x250bp sequencing of V4 region (approx 250bp in length) inflates number of OTUs (23793624)
  • “This study demonstrates the potential for differential bias in bacterial community profiles resulting from the choice of sequencing platform alone.” (10.1128/AEM.02206-14)
  • “The effects of DNA extraction and PCR amplification for our protocols were much larger than those due to sequencing and classification” (10.1186/s12866-015-0351-6)
  • “Nested PCR introduced bias in estimated diversity and community structure. The bias was more significant for communities with relatively higher diversity and when more cycles were applied in the first round of PCR” (10.1371/journal.pone.0132253)
  • “pyrosequencing errors can lead to artificial inflation of diversity estimates” (19725865)
  • “Our findings suggest that when alternative sequencing approaches are used for microbial molecular profiling they can perform with good reproducibility, but care should be taken when comparing small differences between distinct methods” (25421243)

Data analysis strategy

 

Bioinformatics / database issues

 

Contamination in kits

  • “Reagent and laboratory contamination can critically impact sequence-based microbiome analyses” (25387460)
  • “Due to contamination of DNA extraction reagents, false-positive results can occur when applying broad-range real-time PCR based on bacterial 16S rDNA” (15722157)
  • “Relatively high initial densities of planktonic bacteria (10(2) to 10(3) bacteria per ml) were seen within [operating ultrapure water treatment systems intended for laboratory use]” (8517737)
  • “Sensitive, real-time PCR detects low-levels of contamination by Legionella pneumophila in commercial reagents” (16632318)
  • “Taq polymerase contains bacterial DNA of unknown origin” (2087233)

 

Commercial stuff

 

 

 

 

 

Bibliography

Why you probably shouldn’t be happy at Stern’s recommendations for REF

If you are a British academic then you will be aware that Lord Stern published his recommendations for REF (the research excellence framework) this week.  REF is a thoroughly awful but necessary process.  Currently your academic career is distilled down to 4 publications and assessed every 4-5 years.  Papers are classified via some unknown system into 2*, 3* or 4* outputs and your value as an academic recorded appropriately.   Given that higher REF scores result in more money for Universities, your individual REF score has a very real impact on your value to your employer.  This has pros and cons as I will set out below.

Here are some of the recommendations and my thoughts:

  • Recommendation 1: All research active staff should be returned in the REF
  • Recommendation 2: Outputs should be submitted at Unit of Assessment level with a set average number per FTE but with flexibility for some faculty members to submit more and others less than the average.
  • What currently happens?  At the moment your employer decides whether your outputs make you “REF-able”.  In other words, if you don’t have 4 good outputs/publications, you won’t be submitted and you are REF-invisible
  • Stern recommendation: Stern recommends that all research-active staff be submitted and that the average number of outputs is 2.  However, there is a twist – the number of submissions per person can be between 0 and 6.  Therefore you may be submitted with zero outputs, which is perhaps even worse than being REF-invisible.  Given the formula for the number of expected outputs is 2*N (where N is the number of research-active staff), if a University has less than 2*N good impacts, there must surely be a pressure to transfer those with few outputs onto a teaching contract rather than a research contract.  And given the range of 0 to 6, I can see established Profs taking up all 6, with early career researchers being dumped or submitted with zero outputs.  So I’m not impressed by this one.

 

  • Recommendation 3: Outputs should not be portable.
  • What currently happens?  At the moment, an output stays with the individual.  So if I publish a Nature paper during a REF cycle and then move to another University, then my new employer gets the benefit, rather than my old employer.  This has resulted in REF-based recruitment, whereby individuals are recruited by Universities (often with high salaries and incentives) specifically because they have good REF outputs.
  • Stern recommendation: that outputs are not portable.  Specifically that publications remain with the employer present when they are accepted for publication.   It’s worth reading what the Stern report says here: “There is a problem in the current REF system associated with the demonstrable increase in the number of individuals being recruited from other institutions shortly before the census date. This has costs for the UK HEI system in terms of recruitment and retention”.   Read and re-read this sentence in context – high impact publications directly influence how much money a University gets from the government; yet here Stern argues that this shouldn’t be used for “recruitment and retention” of staff who produce those publications.  In other words current REF rules are pitched not as some sort of incentive to reward good performance, but as some kind of unnecessary cost that should be banished from the system.   Yes – read it again – potential staff rewards for good performance (“retention”) are quite clearly stated as a “cost” and as a “problem” to HEIs.
  • What the old REF rules did, in a very real way, is give power to the individual.  Publish some high impact papers and not only will other HEI’s offer you a job, but your existing employer might try and keep you, offering incentives such as pay rises and promotions.  What Stern is recommending is that power is taken from the individual and handed to the institution.  Once you publish, that’s it, they own the output.  No need to reward the individual anymore.
  • This also has the perverse outcome that an institution’s REF score shows how good they were not how good they are.  Take an extreme toy example – University A might have 100 amazing researchers between 2010 and 2014 and achieve an incredible REF score in 2015; yet they all may have left to go to University B.  How good is University A at research?  Well, not very good because all of their research-active staff left – yet they still have a really good REF score.

 

I don’t really have any major objections to the other recommendations; I think Stern has done a pretty good job on those.  However, I’m not at all happy with 1-3 above.   There are actually very few incentives for pay rises amongst UK academics, and REF was one of those incentives.  Stern wants to remove it.  You can see how healthy your University’s accounts are here (from here);  you will see that the vast majority (about 110 out of 120) UK universities generated an annual surplus last year, and the whole sector generated a surplus of £1.8Bn.   Yet somehow, incentives to promote, recruit and retain staff who are performing well is  a “cost” and a “problem”.  I also don’t think that the recommendations help ECRs as they could remain invisible to the entire process.

In conclusion, I don’t think the recommendations of Stern – or to give him his full title, Professor Lord Stern, Baron of Brentford, FRS, FBA – do anything positive for the individual researcher, they don’t provide much help for ECRs, and they hand power back to Universities.

 

 

Not providing feedback on rejected research proposals is a betrayal of PIs, especially early career researchers

Those of you who follow UK research priorities can’t have missed the creation of the Global Challenges Research Fund (GCRF).  Over the last few months the research councils in the UK have been running the first GCRF competition, a two stage process where preliminary applications are sifted and only certain chosen proposals are allowed to go through to the second stage.   We put one in, and like many others didn’t make the cut.  I don’t mind this, rejection is part of the process; however I do worry about this phrase in the rejection email:

Please note specific feedback on individual outlines will not be provided.

Before I go on, I want to launch an impassioned defense of the research councils themselves.  Overall I think they do a very good job of a very complex task.  They must receive many hundreds of applications, perhaps thousands, every year and they ensure each is reviewed, discussed at committee and a decision taken.  Feedback is provided and clearly UK science punches above its weight in global terms, so they must be doing something right.  They are funding the right things.

I’m also aware that they have had their budgets cut to the bone over the last decade and by all accounts (anecdotal so I can’t provide links) Swindon office has been cut to the bare minimum needed to have a functional system.  In other words, in the face of cuts they have protected research spending.  Good work 🙂

I kind of knew GCRF was in trouble when I heard there had been 1400 preliminary applications.  £40M pot with expected grants of £600k means around 66 will be funded.  That’s quite a sift!

The argument will go that, with that sheer number of applications there is no way the research councils can provide feedback.  Besides, it was a preliminary application anyway, so it matters less.

I couldn’t disagree more, on both accounts.

First of all lets deal with the “preliminary” application thing.  Here is what had to happen to get our preliminary application together:

  • Initial exchange of ideas via e-mail, meetings held, coffee drunk
  • Discussions with overseas collaborators in ODA country via skype
  • 4-page concept note submitted to Roslin Science Management Group (SMG)
  • SMG discussed at weekly meeting, feedback provided
  • Costings form submitted and acted on by Roslin finance
  • Quote for sequencing obtained from Edinburgh Genomics
  • Costings provided by two external partners, including partner in ODA country
  • Drafts circulated, commented on, acted on.
  • BBSRC form filled in (almost 2000 words)
  • Je-S form filled in, CV’s gathered, formatted and attached, form submitted

In actual fact this is quite a lot of work.  I wouldn’t want to guess at the cost.

Do we deserve some feedback?  Yes.   Of course we do.

When my collaborators ask me why this was rejected, what do I tell them?  “I don’t know”?  Really?

Secondly, let’s deal with the “there were too many applications for us to provide feedback” issue.  I have no idea how these applications were judged internally.  I am unsure of the process.  However, someone somewhere read it; they judged it; they scored it; forms were filled in; bullet points written; e-mails sent; meetings had; a ranked list of applications was created; somewhere, somehow, information about the quality of each proposal was created – why can we not have access to that information?  Paste it into an e-mail and click send.  I know it takes a bit of work, but we put in a lot of work too, as did 1400 other applications.  We deserve feedback.

At the moment we just don’t know – was the science poor?  Not ODA enough?  Not applied enough?  Too research-y?  Too blue sky?  Wrong partners?  We are floundering here and we need help.

Feedback to failed proposals is essential.  It is essential for us to improve, especially for young and mid- career researchers, the ones who haven’t got secure positions, who are being judged on their ability to bring in money.  Failure is never welcome, but feedback always is.  It helps us understand the decision making processes of committees so we can do better next time.  We are always being told “request feedback” so when it doesn’t come it feels like a betrayal.  How can we do better if we don’t know how and why we failed?

So come on research councils; yes you do a great job; yes you fund great science, in the face of some savage budget cuts.  But please, think again on your decision to not provide feedback.  We need it.

I can’t recreate a graph from Ioannidis et al – can you?

Very quick one this!  Really interesting paper from Ioannidis et al about citation indices.

I wanted to recreate figure 1, which is:

journal.pbio.1002501.g001

Closest I could get (code here) is this:

plos_weird

Biggest difference is in NS, where they find all negative correlations, but most of mine are positive.

Source data are Table S1 Data.

Am I doing something wrong?  Or is the paper wrong?

 

UPDATE 9th July 2016

Using Spearman gets us closer but it’s still not quite correct (updated code too)

results_spearman

Which reference manager do you use?

So I sent out this tweet yesterday and it produced a bit of a response, so I thought it would be good to get an idea of how people reference when writing papers and grants:

Here is how I do it in Word and Mendeley.

1) Create a new group in Mendeley Desktop

blog1

2) Find a paper I want to cite in pubmed

blog2

3) Click on the Mendeley Chrome plug-in and save it to my new group

blog3

4) Click “insert a citation in Word”:

blog4

5) Search and add the citation in the Mendeley pop-up:

blog5

6) Change the style to something I want….

blog6

7) here choosing “Genome Biology”
blog7

8) Add my bibliography by clicking “Insert Bibliography” in Word:

blog8

 

9) Rinse and repeat and I generally add publications iteratively as I write 🙂

 

In an ideal world this would spurn many other blog posts where people show how they use alternate reference managers 🙂

Plot your own EU referendum poll results

Due to the unspeakable horror of the EU referendum, I have to find something to make me feel better.  This poll of polls usually does so, though it is way too close for comfort.

Anyway, I took their data and plotted it for myself.  Data and script are on github, and all you need is R.

Enjoy! #voteRemain

voteRemain

Open analysis of ZiBRA project MinION data

The ZiBRA project aims to travel around Brazil, collecting Zika samples as they go and sequencing them in “real time” using Oxford Nanopore’s MinION.  They released the first data yesterday, and I have put together some quick scripts to analyse that data and produce a consensus sequence.

First we get the data:


# get Zika reference
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_012532.1&rettype=fasta&retmode=txt" > zika_genome.fa

# get the data
wget -q http://s3.climb.ac.uk/nanopore/Zika_MP1751_PHE_Long_R9_2D.tgz

# surprise, it's not zipped
tar xvf Zika_MP1751_PHE_Long_R9_2D.tgz

Extract 2D FASTQ/A, calculate read lengths, extract metadata:


# extract 2D FASTQ
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.2D.fastq 
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.2D.fastq 

# convert to FASTA
porefq2fa zika.pass.2D.fastq > zika.pass.2D.fasta
porefq2fa zika.fail.2D.fastq > zika.fail.2D.fasta

# read lengths
cat zika.pass.2D.fastq | paste - - - - | awk '{print $3}' | awk '{print length($1)}' > zika.pass.2D.readlengths.txt

# extract metadata
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.meta.txt
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.meta.txt

Align to reference genome using BWA


# bwa index the genome
bwa index zika_genome.fa

# samtools index the genome
samtools faidx zika_genome.fa

# run bwa and pipe straight to samtools to create BAM
bwa mem -M -x ont2d zika_genome.fa zika.pass.2D.fastq | \
        samtools view -T zika_genome.fa -bS - | \
        samtools sort -T zika.pass_vs_zika_genome.bwa -o zika.pass_vs_zika_genome.bwa.bam -

samtools index zika.pass_vs_zika_genome.bwa.bam

Align to reference genome using LAST


# create a LAST db of the reference genome
lastdb zika_genome zika_genome.fa

# align high quaklity reads to reference genome with LAST
lastal -q 1 -a 1 -b 1 zika_genome zika.pass.2D.fasta > zika.pass_vs_zika_genome.maf

# convert the MAF to BAM with complete CIGAR (matches and mismatches)
python nanopore-scripts/maf-convert.py sam zika.pass_vs_zika_genome.maf | \
    samtools view -T zika_genome.fa -bS - | \
    samtools sort -T zika.pass_vs_zika_genome.last -o zika.pass_vs_zika_genome.last.bam -

samtools index zika.pass_vs_zika_genome.last.bam

Count errors using Aaron Quinlan’s excellent script:


# count errors
python nanopore-scripts/count-errors.py zika.pass_vs_zika_genome.last.bam > zika.pass_vs_zika_genome.last.txt

Produce a consensus sequence:


# produce consensus
samtools mpileup -vf zika_genome.fa zika.pass_vs_zika_genome.bwa.bam | bcftools call -m -O z - > allsites.vcf.gz
bcftools index allsites.vcf.gz
bcftools consensus -f zika_genome.fa allsites.vcf.gz > zika.minion.consensus.fasta

And finally, produce some QC images:


# QC images
./scripts/qc.R

This is all in a GitHub repo; I’m pretty sure the results produced here aren’t going to be used for anything serious, but this serves as a simple, exemplar, self-contained and reproducible analysis.

Converting from GenBank format to FASTA using Microsoft Excel

Should be a popular one this 🙂

First of all we need an example.  I’m using R23456, which we can download from NCBI.  On that page, look towards the top-right, click “Send To”, choose “File”, leave format as “GenBank (full)” and click “Create File”.  This should download to your computer.

In Excel, click File -> Open, navigate to the folder you downloaded the GenBank sequence to, make sure “All files (*.*)” is chosen in the File Open dialogue box and choose the recently downloaded file (it is almost certainly called sequence.gb)

This should immediately open up the Excel “Text Import Wizard”.  Just hit Finish.  You should see something like:

excel

Now, scroll down to the sequence data, which in my version is in cells A55 to A61.  Select all of those cells.

Select the Data menu and then click “Text to columns”

test2columns

This brings up another wizard, but again you can just click finish.  The sequence data is now in B55 through E61.

Now click in cell I55 and type “=B55&C55&D55&E55&F55&G55”.  Hit enter.  Now click on the little green square and drag down to I61 so we have sequence data for all of the cells.

Now we’re getting close!

Click cell A1 and then repeat the “Text to Columns” trick.

Now go back to down to cell I54 and type “=’>’&B1” and hit enter.  We now have something that looks a lot like FASTA!

Final step – select all cells in the range I54:I61, hit Ctrl-C, then hit the “+” at the bottom to add a new sheet, and in the new sheet hover over cell A1, right-click your mouse and choose Paste Special.

In the “Paste Special” dialogue click the radio button next to “Values” and hit OK.

Now click File -> Save As, navigate to a suitable folder, make sure “Save as type” is set to “Text (Tab delimited) (*.txt)”, give it a filename and hit Save.  Click “OK” and “Yes” to Excel’s next two questions.

Go look at the file, it’s FASTA!

fasta

Awesome!  Who needs bioinformaticians eh?

(and if you use this, please never, ever speak to me.  Thanks 🙂)

You don’t need to work 80 hours a week in academia…. but you do need to succeed

I’ve been thinking a lot lately about academic careers, chiefly because I happen to be involved in some way with three fellowship applications at the moment.  For those of you unfamiliar with the academic system at work here, the process is: PhD -> Post-Doc -> Fellowship -> PI (group leader) -> Professorship (Chair).  So getting a fellowship is that crucial jump from post-doc to PI and represents a person’s first chance to drive their own research programme.  Sounds grand doesn’t it?  “Drive your own research programme”.  Wow.  Who wouldn’t want to do that?

Well, be careful what you wish for.  Seriously.  I love my job; I love science and I love computers and I get paid to do both.  It’s amazing and possibly (for me) one of the best jobs in the world.  However, it comes with huge pressures; job insecurity; unparalleled and relentless criticism; failures, both of your ideas and your experiments; and occasionally the necessity of working with and for people who act like awful human beings.  It also requires a lot of hard work, and even then, that isn’t enough.  This THE article states very clearly and eloquently that very few people actually work an 80 hour week in academia, and you do not need to in order to succeed.   I would tentatively agree, though I have pointed out some of the things you need to do to succeed in the UK system, and one of them is working hard.

It’s true, you don’t need to work 80 hours a week in academia…. but you do need to succeed.

What does success look like?

Unfortunately, science lends itself very well to metrics: number of papers published; amount of money won in external grant funding; number of PhD students trained; feedback scores from students you teach; citation indices; journal indices.  And probably many more.  All of these count, I’m sorry but they do.  We may wish for a better world, but we don’t yet live in one, so believe me – these numbers count.

To succeed as a PI, even a new one such as a fellow, you will need to win at least one external grant.  Grant success rates are published: here they are for BBSRC, NERC and MRC.  I skim-read these statistics and the success rate for standard “response mode” grants seems to be somewhere between 10 and 25%.  However, bear in mind that this includes established professors and people with far better track records and reputations than new fellows have.   Conservatively, I would half those success rates for new PIs, taking your chances of success to between 5 and 12%.  What that means is you’re going to have to write somewhere between 8 and 20 grants just to win one.   I couldn’t find statistics for the UK, but the average age a researcher in the US gets their first R01 grant is 42.  Just take a moment and think about that.

It’s not all doom and gloom – there are “new investigator” schemes specifically designed for new PIs.  The statistics look better – success between 20-30% for NERC, and similar for BBSRC.   However, note the NERC grants are very small – £1.3M over 20 awards is an average of £65k per award, and that probably covers you for about 8 months at FEC costing rates.  The BBSRC new investigator awards have no upper limit, so there is a tiny speck of light at the end of the tunnel.  The statistics say that you will still need to write between 3 and 5 of these just to win one though.

What do grant applications look like?

I am most familiar with BBSRC, so what’s below may be specific to them, but I imagine other councils are similar.  Grant applications consist of the following documents:

  1. JeS form
  2. Case for support
  3. Justification of resources
  4. Pathways to impact
  5. Data management plan
  6. CV
  7. Diagrammatic workplan (optional)
  8. Letters of support (optional)

The JeS form is an online form containing several text sections: Objectives, Summary, Technical Summary, Academic Beneficiaries and Impact Statement.  I haven’t done a word count because they are PDFs, but that’s probably around 1000 words.

The case for support is the major description of the research project and stretches to at least 8 pages, depending on how much money you’re asking for.   Word counts for my last 4 are 4450, 4171, 3666, and 3830.

The JoR, DMP and PtI are relatively short, 1-2 pages, and mine are typically 300-500 words each, so let’s say 1000 words in total.

Therefore, each grant is going to need 6000 words (properly referenced, properly argued) over 5 documents.  They need to be coherent, they need to tell a story and they need to convince reviewers that this is something worth funding.

Given the success rates I mentioned above, there is every possibility that you need to write between 5 and 10 of these in any given period to be deemed a success.   In other words, for success, you’re going to need to write often, write quickly and write well.   Don’t come into academia if you don’t like writing.

(by the way, there is such a thing as a LoLa which stands for “longer, larger”.  These are, as you may guess, longer and larger grants – the last one I was involved in, the case for support was 24 pages and 15,400 words – about half a PhD thesis)

Failure is brutal

I’ll take you through a few of my failures so you can get a taste….

In 2013 the BBSRC put out a call for research projects in metagenomics.  We had been working on this since 2011, looking to discover novel enzymes of industrial relevance from metagenomics data.  What we found when we assembled such data was that we had lots of fragmented/incomplete genes.  I had a bunch of ideas about how to solve this problem, including  targeted assembly of specific genes, something we were not alone in thinking about.    Reviews were generally good (Exceptional, Excellent and Very Good scores), but we had one comment about the danger of mis-assemblies.  Now, I had an entire section in the proposal dealing with this, basically stating that we would use reads mapped back to the assembly to find and remove mis-assembled contigs.  This is a tried, tested, and established method for finding bad regions of assemblies, and we have used it very successfully in other circumstances.   Besides which, mis-assembled contigs in metagenomic assemblies are very rare, probably around 1-3%.  I explained all this and didn’t think anything of it.  Mis-assemblies really aren’t a problem, and we have a method for dealing with it anyway.

The grant was rejected.  I asked for feedback from the committee (which can take 3 months by the way, and is often just a few sentences).   The feedback was that we had a problem with mis-assemblies and we didn’t have a method for dealing with it.  Apparently, the method we proposed (a tried and tested method!) represented a “circular argument” i.e. using the same reads to both assembly and validation was wrong.   Anyone working in this area can see that argument doesn’t make sense.  So our grant was rejected, because of a problem that isn’t important, which we had a method for dealing with, by someone who demonstrated a complete lack of understanding of that problem.   Frustrating?  I had to take a long walk, believe me.

In 2015 I wrote a grant to the BBSRC FLIP scheme for a small amount of money (~£150k) to get various bioinformatics software tools and pipleines (e.g. BWA+GATK) working on Archer, the UK’s national supercomputer.   It’s a cray supercomputer, massively parallel but with relatively low RAM per core, and jobs absolutely limited to 48 hours.   The grant was rejected, with feedback containing such gems as “the PI is not a software developer” and “Roslin is not an appropriate place to do software development”.   It’s over a year ago and I am still angry.

The last LoLa I was involved in was the highest scoring LoLa from the committee that assessed it.  They fully expected it to be funded.  It wasn’t, killed at a higher level committee.  So even getting through review and committee approval, you can still lose out.  One of the reviewer’s comments was that better assembled and annotated animal genomes will only represent a “1% improvement” over SNP chips.   I can’t even….

Our Meta4 paper was initially rejected for being “just a bunch of Perl scripts”; our viRome paper similarly rejected for being “a collection of simple R functions”; our paper on identifying poor regions of the pig genome assembly got “it seems a bunch of bioinformaticians ran some algorithms with no understanding of biology”; whilst our poRe paper was initially rejected without review because it “contains no data” (at the time I knew the poretools paper was under review at the same journal and also contained no data).

What point am I trying to make?   That failure is common, criticism is brutal and often you will fail because of comments that are either incorrect, unfair or both.  And there is often no appeal.

Lack of success may mean lack of a job

It’s more and more common now for academic institutions to apply funding criteria when assessing the success of their PIs: there have been redundancies at the University of Birmingham, as expectations on grant income were set; staff at Warwick have been given financial targets; Dr Alison Hayman was sacked for not winning enough grants;  and the awful, tragic death of Stephen Grimm after Imperial set targets of £200k per annum.

To put that in context, the average size of a BBSRC grant is £240k.  So Imperial are asking their PIs to win approx. one RCUK grant per year.  Do the maths using the success rates I mention above.

Is the 80 hour week a myth?

Yes it is; but the 60 hour week is not.  You may have a family, mouths to feed, bills to pay, a mortgage.  To do all of that you need a job, and to keep that job you need to win grants.  Maybe you haven’t won one in a while.  Tell me, under those circumstances, how many hours are you working?

Working in academia (for me) is wonderful.  I absolutely love it and wouldn’t change it for anything else.  However, it’s also highly competitive and at times brutal.  There are nights I don’t sleep.  A few years ago, my dentist told me I had to stop grinding my teeth.

It’s a wonderful, wonderful job – but in the current system, believe me, it’s not for everyone.  I recommend you choose your career wisely.  You don’t need to work 80 hours a week, but you do need to succeed.

 

 

 

« Older posts

© 2016 Opiniomics

Theme by Anders NorenUp ↑