Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: fun stuff (page 1 of 3)

Beautiful boxplots in base R

As many of you will be aware, I like to post some R code, and I especially like to post base R versions of ggplot2 things!

Well these amazing boxplots turned up on github – go and check them out!

So I did my own version in base R – check out the code here and the result below.  Enjoy!

probability

Plotting cool graphs in R

I have to admit to being a bit of a snob when it comes to graphs and charts in scientific papers and presentations.  It’s not like I think I am particularly good at it – I’m OK – it’s just that I know what’s bad.  I’ve seen folk screenshot multiple Excel graphs so they can paste them into a powerpoint table to create multi-panel plots… and it kind of makes me want to scream.   I’m sorry, I really am, but when I see Excel plots in papers I judge the authors, and I don’t mean in a good way.  I can’t help it.  Plotting good graphs is an art, and sticking with the metaphor, Excel is paint-by-numbers and R is a blank canvas, waiting for something beautiful to be created; Excel is limiting, whereas R sets you free.

Readers of this blog will know that I like to take plots that I find which are fabulous and recreate them.  Well let’s do that again 🙂

I saw this Tweet by Trevor Branch on Twitter and found it intriguing:

It shows two plots of the same data.  The Excel plot:

excel

And the multi plot:

multi

You’re clearly supposed to think the latter is better, and I do; however perhaps disappointingly, the top graph would be easy to plot in Excel but I’m guessing most people would find it impossible to create the bottom one (in Excel or otherwise).

Well, I’m going to show you how to create both, in R. All code now in Github!

The Excel Graph

Now, I’ve shown you how to create Excel-like graphs in R before, and we’ll use some of the same tricks again.

First we set up the data:


# set up the data
df <- data.frame(Circulatory=c(32,26,19,16,14,13,11,11),
		 Mental=c(11,11,18,24,23,24,26,23),
		 Musculoskeletal=c(17,18,13,16,12,18,20,26),
		 Cancer=c(10,15,15,14,16,16,14,14))

rownames(df) <- seq(1975,2010,by=5)

df

Now let's plot the graph


# set up colours and points
cols <- c("darkolivegreen3","darkcyan","mediumpurple2","coral3")
pch <- c(17,18,8,15)

# we have one point on X axis for each row of df (nrow(df))
# we then add 2.5 to make room for the legend
xmax <- nrow(df) + 2.5

# make the borders smaller
par(mar=c(3,3,0,0))

# plot an empty graph
plot(1:nrow(df), 1:nrow(df), pch="", 
		xlab=NA, ylab=NA, xaxt="n", yaxt="n", 
		ylim=c(0,35), bty="n", xlim=c(1,xmax))

# add horizontal lines
for (i in seq(0,35,by=5)) {
	lines(1:nrow(df), rep(i,nrow(df)), col="grey")
}

# add points and lines 
# for each dataset
for (i in 1:ncol(df)) {

	points(1:nrow(df), df[,i], pch=pch[i], 
		col=cols[i], cex=1.5)

	lines(1:nrow(df), df[,i], col=cols[i], 
		lwd=4)


}

# add bottom axes
axis(side=1, at=1:nrow(df), tick=FALSE, 
		labels=rownames(df))

axis(side=1, at=seq(-0.5,8.5,by=1), 
		tick=TRUE, labels=NA)

# add left axis
axis(side=2, at=seq(0,35,by=5), tick=TRUE, 
		las=TRUE, labels=paste(seq(0,35,by=5),"%",sep=""))

# add legend
legend(8.5,25,legend=colnames(df), pch=pch, 
		col=cols, cex=1.5, bty="n",  lwd=3, lty=1)

And here is the result:

excel_plot

Not bad eh?  Actually, yes, very bad; but also very Excel!

The multi-plot

Plotting multi-panel figures in R is sooooooo easy!  Here we go for the alternate multi-plot.  We use the same data.


# split into 2 rows and 2 cols
split.screen(c(2,2))

# keep track of which screen we are
# plotting to
scr <- 1

# iterate over columns
for (i in 1:ncol(df)) {

	# select screen
	screen(scr)

	# reduce margins
	par(mar=c(3,2,1,1))

	# empty plot
	plot(1:nrow(df), 1:nrow(df), pch="", xlab=NA, 
		ylab=NA, xaxt="n", yaxt="n", ylim=c(0,35), 
		bty="n")

	# plot all data in grey
	for (j in 1:ncol(df)) {
		lines(1:nrow(df), df[,j], 
		col="grey", lwd=3)

	}	

	# plot selected in blue
	lines(1:nrow(df), df[,i], col="blue4", lwd=4)

	# add blobs
	points(c(1,nrow(df)), c(df[1,i], df[nrow(df),i]), 
		pch=16, cex=2, col="blue4")

	# add numbers
	mtext(df[1,i], side=2, at=df[1,i], las=2)
	mtext(df[nrow(df),i], side=4, at=df[nrow(df),i], 
		las=2)	

	# add title
	title(colnames(df)[i])

	# add axes if we are one of
	# the bottom two plots
	if (scr >= 3) {
		axis(side=1, at=1:nrow(df), tick=FALSE, 
			labels=rownames(df))
	}

	# next screen
	scr <- scr + 1
}

# close multi-panel image
close.screen(all=TRUE)

And here is the result:

multi_plot

 


And there we have it.

So which do you prefer?

Plot your own EU referendum poll results

Due to the unspeakable horror of the EU referendum, I have to find something to make me feel better.  This poll of polls usually does so, though it is way too close for comfort.

Anyway, I took their data and plotted it for myself.  Data and script are on github, and all you need is R.

Enjoy! #voteRemain

voteRemain

Travel tips for highly organised and important people

Karthik Ram (@_inundata on Twitter) wrote down recently some of his “travel hacks“, and so I thought I would repeat the kind of sage advice he offers here. Here it is:

1. Luggage: choose the bag that smells the least bad. Remember that the cat likes to pee in them so check that first; also, the baby might have puked in there. The best bag is one that still contains all the stuff from your last trip that you can simply use again (to prevent the use of inappropriate clothes, simply never ever dress seasonally).  Also, pack at the very last minute, to make sure you don’t waste any precious time.

2. Power: go to the drawer that contains all of the screwdrivers, old keys and travel adapters and put every single one of the travel adapters you own in the bag.  Every. Single. One. Then buy another one at the airport.  Take the external battery you bought, but is never charged.  Load up at least 3 broken USB cables, and make sure to put it in your checked bag so you can’t use them in the airport, even though you really want to.  Simply buy a whole new set of chargers in every airport you go to.

3. Sleep: never, ever refuse the opportunity to have an espresso in the airport, no matter what time of day or night, so that you’re wired and sweating for the entire flight.  Do not watch sad movies.  Remember that sad things in movies come back to you late at night when you can’t sleep, and you imagine them happening to you and your family and you cry into your pillow (and that really didn’t work out that time in China when the pillow was made of rice).

4. WiFi: always run up at least a £100 roaming 3G bill on your personal mobile phone by checking your emails and Twitter constantly.  Never take your work mobile, it is two models out of date and soooo uncool.

5. Health: ensure you maintain your health by somehow sitting next to the person who has the worst cold you have ever witnessed.  Every. Single. Time.

6. Accessories: buy those tiny expensive toiletries at the airport and never claim them on expenses.  They’re so cute!  They look exactly like the grown up ones, but teeny – squee!

7. Apps: NO NEVER USE YOUR PHONE THE BATTERY WILL RUN OUT, STOP STOP STOP SWITCH IT OFF TO SAVE BATTERY.  YES I KNOW THE SEAT IN FRONT HAS A USB PORT BUT YOU DON’T HAVE THE RIGHT CABLE!

And with that, I will wish you a happy new year!

Mick

 

 

Finally, a definition for bioinformatics

Following the trend for ultra-short-but-to-the-point blog posts, I have decided to finally define bioinformatics:

bio-informatics (bi-oh-in-foh-shit-I-don’t-understand-how-that-works)

From the word bio, meaning “of or related to biology” and informatics, meaning “absolutely anything your collaborators or boss don’t understand about maths, statistics or computing, including why they can’t print and how the internet works”

Happy to finally put that one to bed!

101 bioinformatics facts

I am trying to crowd-source 101 fun bioinformatics facts!  Please contribute in the comments and I will add them below:

  1. BLAST is so fast, the authors had to deliberately slow down the code so it doesn’t overheat the servers
  2. GCG, the old bioinformatics package, was named after the authors kept high-fiving each other, shouting “good code guys!”
  3. Bowtie is named so because “it is almost impossible to tie”, referring to code to avoid a “race condition” when using multiple processors
  4. TopHat is named do because it was the first spliced RNA-Seq aligner, and when it worked first time, the authors shouted “Top that!”
  5. Over 1 billion people have searched the NCBI protein database for their own name
  6. The EBI is an elaborate front-end to NCBI services
  7. The SRA (short read archive) is the best known of the archives, and not many people know or use the MRA (medium read archive), the KLRA (kinda long read archive) and the LRA (long read archive)
  8. Europe PubMed Central has only ever been accessed by people accidentally clicking on links.  100% of visitors immediately bounce to pubmed.com
  9. There are now more journals than papers
  10. The HGAP assembler is actually an elaborate front-end hiding three thousand slave labourers all running GAP4 (via @IanGoodhead)
  11. HPC actually means “Homunculus Powered Computing”, and all servers are actually just mechanical turks full of leprechauns (via @froggleston)
  12. The biosemantics.org group of LUMC is doing psipred protein folding on 1kWh household radiators (128 cores each) https://vimeo.com/122893200 (via Eric Feliksik)
  13. The ‘p’ in p-value actually stands for p-otentially interesting! (via )
  14. Velvet is so named because @dzerbino wore velvet gloves when coding it (via @pathogenomenick)
  15. The @PacBio machines are so large because inside’s an Illumina machine + a bioinformatician running assemblies (via )
  16. The Cloud is actually just a cloud. That’s it. A real cloud (via @froggleston)
  17. If you plug in a @nanopore MinION and hit left,right,up,up,A,B,down, it’ll transform into a lifesize statue of @Clive_G_Brown (via @froggleston)
  18. DDBJ have their data centre in a volcano, and are basically a front for Osato Chemicals (SPECTRE) (via @SCEdmunds)
  19. Hidden Markov Models were initially developed to find Waldo shawnhymel.com/portfolio/413/ (via )
  20. 99.5% of people who cite Altschul et al have never read the paper
  21. Bioinformatics Applications Notes have to be automatically generated by the software they describe
  22. BGI exclusively publish in Nature journals because their papers are first rejected by Gigascience
  23. BGI actually only have one HiSeq but made to look like hundreds by a set up of mirrors, like that bit in Enter the Dragon (via @froggleston)
  24. the consumption rate of coffee (+ beer 🍻) among Bioinformaticians from around the world is increasing every year. TRUE FACT! (via @NazeefaFatima)
  25. EBI” actually stands for “European bureau of investigation”. It’s a front of the EU secret service, collecting genomic info (via @klmr)
  26. There are only 3 facts in 101 (via @mcaccamo)
  27. If all you have is a hmmer everything looks like it can be resolved with Viterbi (via @mcaccamo)
  28. Hidden Markov Models are like the recipe for Kentucky Fried Chicken.  There are only three people in the world who understand small parts of how HMMs work, and only when they get together do they know the full picture
  29. The “e” in e-value stands for “excellent”, as in “that’s an excellent BLAST hit”
  30. The Burrows-Wheeler transform, used in BWA and Bowtie, saves memory by transforming the DNA sequence data into a parallel dimension, meaning it ceases to exist in 4D space/time in this Universe
  31. Base qualities are called “Phred” scores in honour of Fred Sanger who developed DNA sequencing. #101bioinfofunfacts (via @tostenseemann)
  32. In a recent public survey of the 100 most desirable jobs, bioinformatician was a close second to astronaut (via @dynomics)
  33. Heng Li writes all his code in x86 assembly language, and uses a C decompiler before releasing it. @lh3lh3 (via @torstenseemann)
  34. The EBI secretly funds the Perl Foundation to ensure its legacy internal software infrastructure won’t collapse (via @torstenseemann)
  35. Illumina reads are short as before the development of Basespace they were delivered via Twitter (via @RoyChaudhuri)
  36. Pet Bioinformaticians are paid with cuddling #101bioinfofunfacts (via )
  37. Python was conceived in the 1980’s by @gvanrossum & named after his favourite British comedy, Monty Python’s Flying Circus (via )
  38. the word “ELVIS” appears 35 times in human peps (GRch38). “ELVISLIVES” appears 0 times. The king has left the genome #slowday (via @rdemes)
  39. Tuxedo suit is so named that only ‘privileged’ know how to use it ! #bioinformaticsfun (via )
  40. It’s easy! You only have to download this database in which all the genes have only one ID and you can retrieve the IDs in the most important databases (via @jorjial)
  41. If you stand in front of a mirror and say ‘HiSeq’ 3 times, Illumina staff member will show up holding the HiSeq X Ten system (via @nazeetafatima)
  42. @BenLangmead wrote Bowtie while wearing a tuxedo but he did all the testing in zip-up onesie batman pajamas (via @coletrapnell)
  43. Spike-ins are like gold (via @nomad421)
  44. Do you need more hard disk space to store and do the analysis? Sure! Let’s buy 10 hard disk of 3 TB in the supermarket (via @jorjial)
  45. This could be the basis for 10.1 papers in PLOS Comp. Biol. (via @kbradnam)
  46. All bioinformatics problems can be solved through the medium of twitter, snide and ranting 😉 (via @guyleonard)
  47. Installing TopHat with option –reverse will install HotTap, a program that spews vapid results on a random science hot topic (via @CamLBerthelot)
  48. SOLiD sequencers generated colour-space sequence using an algorithm based on the once popular “Simon Says” hand held game (via @iandcalling)
  49. CriMap was called CriMap because users do an awful lot of crying before they get a half decent map (via @dj_de_koning)
  50. A single anonymous donor, RP11, accounts for 72 percent of the human reference genome (via CanGenom)
  51. If you amass the de-bugging tears of a bioinformatician it is enough to fill an Olympic size swimming pool annually (via @paulhoskisson)
  52. FASTA 80 character line wrapping was invented to standardise data sharing using MS Word (via @IanGoodhead)
  53. nine out of ten Bioinformaticians prefer Excel (via
  54. if you’ve never shown the NIH sequencing costs plot in talk/lecture you’re not a real bioinformatician pic.twitter.com/jQzG7MGosd (via @AliciaOshlack)
  55. Illumina is short for Illuminati, the shadowy organisation that controls sequencing worldwide. (via @neilfws)
  56. Every time you run a closed source bioinformatics tool, a PhD student’s soul is sacrificed to the Blood God. (via @froggleston)
  57. The number of replicates needed for your RNA-seq experiment equals the impact factor of the journal you want to publish in (via @torstenseemann)
  58. NCBI’s bacterial annotation takes 6 weeks because it’s done manually by work experience students pasting ORFs into web BLAST (via @torstenseemann)
  59. The majority of bioinformaticians can’t pronounce “de Bruijn” properly (see also thegenomefactory.blogspot.sg/2013/08/how-to… @torstenseemann) (via @rvaerle)
  60. Oxford Nanopore plans to introduce a new FASTQ encoding scheme using an ASCII offset of 48 with optional emoji (via @torstenseemann)
  61. The HMMer package was so named when someone asked how it worked, and the developers said “Hmmmm… errr….” (via @mgollery)
  62. 63% of Bioinformaticists were Biologists to start with, but they realized that the cold room is really COLD! (via @mgollery)
  63. It has been calculated that there are twice as many data formats as there are Bioinformaticians (via @mgollery)

Analysing MinION data with no bioinformatics expertise nor infrastructure

It has been a long standing aim of mine to enable easy analysis of Oxford Nanopore MinION data, preferably by people who don’t have a ton of bioinformatics skills.  This was one of the reasons we started to develop poRe, which is really easy to install, runs on any platform and is fairly easy to use.

Below is a proof-of-concept method for analysing MinION data without the need for bioinformatics infrastructure – I am doing this on my Windows laptop, with only R and an internet connection.  We are going to use the EBI web-services to do the heavy lifting.  Please do not abuse this!  EBI probably don’t want you to submit 20,000 jobs.  As I said, this is “proof-of-concept”.  It means you’re not supposed to actually use it in practice 😉

Some polite things to do:

  1. Use your own, real e-mail address when/if you run the code below
  2. Please stick to a small number of jobs! (unless EBI tell you it’s OK)

We’re going to do everything in R.

Make a directory on your computer where we will do the work.  You will need approx. 15Gb of space (I know, the data are big!).   My directory is going to be “C:/Users/Mick/Documents/MinION”.  We’re going to use data from Nick Loman’s group, published here and here.

# go to the working directory
setwd("C:/Users/Mick/Documents/MinION")

# load the necessary packages
library(poRe)
library(RCurl)

# download one of Nick's runs and unpack it
url <- "ftp://ftp.sra.ebi.ac.uk/vol1/ERA362/ERA362836/oxfordnanopore_native/Ecoli_R7_ONI_flowcell_17.tar.gz"
dest <- "Ecoli_R7_ONI_flowcell_17.tar.gz"
download.file(url=url, destfile=dest)
untar("Ecoli_R7_ONI_flowcell_17.tar.gz")

This will take some time, as there is about 10Gb of data there.  If I had a smaller dataset, I would use it.

Now, we use the poRe get_fast5_info() function to read statistics from the several thousand MinION reads, which again takes some time.  We then just select the reads that have a 2D sequence:

# read information from all .fast5 files in the current working directory
finfo <- read.fast5.info(dir=getwd())

# get a list of fast5 files where the "len2d" is greater than zero
reads.2d <- rownames(finfo)[finfo$len2d>0]

Now, we are going to use the EBI REST API for NCBI BLAST.  You can read about it at the link provided, but essentially it is about building and querying custom URLs to submit BLAST jobs and fetch the result.

We are only going to do this for the first ten fast5 sequences.  Please do not submit more, or the EBI may block you.  If you want to use this in anger, I suggest contacting the EBI directly and asking them politely.  The code blasts each sequence with default parameters against the EMBL bacterial database, saves the BLAST report to your current working directory, and prints out the top hit.

Here goes:

# iterate over the first ten 2D reads
for (f5 in reads.2d[1:10]) {

    # use poRe function get_fasta() to get the 2D sequence
    	fasta.2d <- get_fasta(f5)$'2D'
	
    # build a list of parameters for the EBI BLAST service
    	myl <- list(sequence=fasta.2d,
                stype="dna",
		                program="blastn",
		                email="yourownaddress@email.com",
		                database="em_rel_pro")

    # build the submission URL and submit the job
    runurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/run"
    	jobid <- postForm(runurl, .params=myl)

    # query the status URL for the job until the status
    # is no longer "RUNNING"
    staturl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/status/"
    	surl <- paste(staturl, jobid, sep="")
    	status <- "RUNNING"
    	while(status == "RUNNING") {
          		status <- getURL(surl)
	    }

    # build the output URL and fetch the results
    outurl <- "http://www.ebi.ac.uk/Tools/services/rest/ncbiblast/result/"
    	ourl <- paste(outurl, jobid, "/out", sep="")
    	out <- getURL(ourl)

    # write the output to a file in the current working directory
    	bloutfile <- gsub(".fast5",".blast",f5)
    	write(out, bloutfile)

    # print out the top hit
    	print(strsplit(out, "n")[[1]][22])

}

Now, it’s not quick – it probably takes about 2-3 minutes to complete on my laptop.  It also isn’t very responsive (in that it looks a bit like it may not be doing anything!).  However, eventually the R terminal will show the top hits:

blast_top_hit

You can see from the above that the first few reads are a bit dodgy and don’t really hit anything, but pretty quickly the reads start hitting E coli K12, which is what they sequenced.

Also, in your directory you will find a whole bunch of BLAST reports:

blast_files

Note, as “proof-of-concept” code, there is very little (i.e. nothing) in the way of error-checking in the above code!  I mean, come on, it’s only a blog post!

Enjoy!

Excited by bacterial DNA in sweet potatoes? Check this out!

Cool paper – turns out that the genome of the sweet potato has segments of the Agrobacterium genome in it! Genuinely cool stuff.

However, I found something even more exciting!  A tiny virus genome present in huge numbers of sequenced bacteria.  Check it out:

  1. GO here: http://blast.ncbi.nlm.nih.gov/Blast.cgi
  2. Click on Nucleotide Blast
  3. Where it says “Enter accession number, gi or FASTA”, simply enter the number: 9626372
  4. Where it says “Database” make sure you choose “others”
  5. Just under there, where it says “Organism (optional)” start typing the word Bacteria, and choose “Bacteria (taxid: 2)”
  6. Click the Blast button at the bottom

OMG!  This tiny viral genome is in E coli, it’s in Acinetobacter, Desulfitobacterium, Sphingobacterium….

I’M EMBARRASSED FOR THEM! #SHAME

Data from “The top 50 science stars of Twitter”

So I scraped the data from Science’s article “The top 50 science stars of Twitter“, and it is available here.

Have fun with it!

citations_followers

Genomics: prepare for the quacks!

Today is April fool’s day, a day when we traditionally put up joke stories in an attempt to fool other people.  I thought I’d turn this round and report on things that should be April fool’s stories but aren’t.

In keeping with the subject of this blog, I wanted to look at what will happen when everyone, when all of you, own your own genome data.  And what will happen is that there will be a new economy, 100s of new companies just waiting to con the stupid out of their money:

  1. “Love is no conincidence” – give these guys your genome data and they will find your perfect partner
  2. Too fat or unhealthy?  You’re eating the wrong food and the answer is encoded in your genome
  3. Hate the gym, but can’t play sports?  Don’t worry, we can provide your personalised genomic fitness plan
  4. Found the perfect genomic partner using genepartners.com?  Want to know how long they’ll live?  Easy!
  5. Have you always wanted to know how Yoga affects your genome?  No?  Well, if you do, “mind-body genomics” is a thing now.
  6. Do you want to improve your health?  Who doesn’t?  Well, why not just eat some DNA?
  7. Struggling to understand homeopathy?  It’s epigenetic, you clown!

To genomicists, scientists, rationalists and anyone else who uses an evidence-based philosophy to understand the World, I say this: prepare for the quacks.

Older posts

© 2018 Opiniomics

Theme by Anders NorenUp ↑