Opiniomics

bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Category: Uncategorised

Come and do a PhD with us!

Come and do a PhD in my group, in collaboration with our amazing partners!

Please apply using University of Edinburgh links in the above adverts.

Below are all the reasons why you MUST COME AND STUDY WITH US!  These are the the kind of thing we can’t fit in the adverts. Please apply, even if you think you don’t yet have the skills – I am sure we can work on that for the right person.

(notes below refer to my lab at The Roslin Institute! I am sure the labs of the other supervisers are cool too!)

1. Institutional environment

You will join my lab at The Roslin Institute, a world-famous research institute in Edinburgh. We carry out world class reseach in biotechnology, and are now fully embedded within the University of Edinburgh, consistently ranked in the top 20 universities worldwide.

The institute is housed in a brand new building put up in 2011, and is across the road from another new building, the Royal (Dick) Vet School. We have two restaurants on site, and a gym. The site is served buses several times a day, and it really is a pleasant place to work. We’re just a little outside the city centre and the views of the local countryside are stunning.

Roslin has a very relaxed atmosphere. There are over 100 PhD students,  and we also have a lively set of post-docs. This means it is easy to make friends and there is always something social going on – for example, Friday is “cake day”, and if you join the cake club you can go and enjoy someone’s cake in the restaurant with all of the other members. You can read the post-doc handbook here!

Honestly, it’s a fantastic place to work and I think you’d love it.

2. Group environment

My aim is to run a relaxed, supportive, open and inclusive group, with a focus on helping individuals realise their objectives, whatever they may be. We are LGBTQ+ friendly and we are family-friendly!

We have a weekly lab meeting where we talk about science and people are encouraged to talk about any problems/issues they have, or whatever they have been thinking about (this is not mandatory though, for those of you who are introverts). My door is always open for one-to-one meetings.

My attitude is that I don’t expect you to succeed, but I do expect you to try. Failure is perfectly fine – we can deal with that and I, or other members of the group, can help. All I expect of you is to put the effort in and try. As long as I see sufficient effort on the project you are working on, as long as there is progress, and as long as you make up the time somehow, I am flexible about how, where and when you work.

I encourage group members NOT to work too hard and to take all of their annual leave. If you need flexible working I am 100% OK with that; need Fridays off? No problem. Need to leave every day at 3 to collect your kids? No problem. Want to work from home? No problem. Having two young children myself, I am particularly sensitive to the needs of young families, and I am also committed to helping young parents get back in to work. I also have no age bias – please, if you are interested in one of our positions, apply!

Of course, it’s good for you to come in and see the other members of the group and to see me, but I am relaxed about this, and will leave it up to you to decide.

3. Research

Our work largely involves generation and analysis of huge amounts of sequence data, either for functional genomics or metagenomics/microbiome. As a lab I encourage everyone to use Snakemake, which has conda and Docker support built in for dealing with software installation issues.  We have access to a 4000+ core University high performance computing system, and we have begun working with cloud technologies such as Google Cloud Platform and Kubernetes. Most of the group code in Python, though there is a significant amount of R and Perl hanging around. I am an R guru, if you’d like to learn it 😉

If you don’t think you have the skills, but would like to learn them (by doing!) then apply anyway – if you’re the right person, we can fill in the technical skills later. Honestly, the technical skills might be the least important. As long as you are comfortable in Linux, everything else can be learned quickly.

Our focus is on discovery biology fuelled by reproducible science. What can we discover that is novel about biological systems using computational biology? We also write and maintain novel software workflows to ensure that (i) our science is reproducible and (ii) others can follow our work and apply it to their own problems.

The group is always full of ideas of what we can do, and we try to keep ourselves at the cutting edge of technology.

4. Edinburgh

Edinburgh is one of the World’s most beautiful cities, with historical buildings, huge green open spaces, and a long golden beach (yes really!). There are too many attractions to list them here, but needless to say, it is an incredible place to live and you will never be short of things to do here. Living expenses are affordable compared to many other cities in the UK and Europe, and decent accomodation at all levels can be found throughout the city; alternatively, many people at Roslin choose to settle just outside the city, in one of the many local villages and towns. There are plenty of options!

5. Funding

Unfortunately I cannot change the academic funding model of the UK single handedly, and so many of the posts in the groupe are 2 or 3 year contracts. However, what I can say is that we have a track record of success in funding and personally I have been consistently funded since 2008, and have current grants that run to 2022 – that will be 14 years of funding from various research agencies with no break. I am confident that I will have positions available for you when you run to the end of your contract, should you wish to stay; if not, there are hundreds of opportunities elsewhere in the University, for all sorts of career paths; and good biologists are always in demand.

6. Travel

Generally speaking I support travel and attendance at conferences and workshops (within reason, and within budget). Attendance at the UK’s Genome Science conference, which I help organise, is always encouraged and many of the team members attend PAG in San Diego every January. We have an active project with collaborators in Nairobi too, and so travel there is also a possibility. However, if you don’t wish to travel, or can’t, for whatever reason, that’s also fine and we can always discuss the best way to handle meetings that are necessary for the project you’re working on.

That’s it! We have a great group, and we are doing fun and important things. Please come!

Mick

University of California makes legal move against Roger Chen (and Genia)

The relationship between sequencing companies is frosty and beset by legal issues, which I’ve covered before here and here.  Keith Robison tends to cover in more detail 😉

Most recently, PacBio moved against Oxford Nanopore, we think claiming that ONT’s 2D technology violated their patent on CCR (link).

Well now the absolute latest is a filing by the University of California against Roger Chen and therefore Genia. If you click through to the documents (requires registration) you’ll see that UC claim Chen, with others, produced key inventions whilst at UC that he later assigned to Genia, but which should have automatically been assigned to UC according to UC’s “oath of allegiance”, which Chen signed as a UC employee.

It awaits to be seen how important this is and no doubt Chen/Genia/Roche will fight tooth and nail; however if the courts decide in UC’s favour it could spell the end of Genia, and at the very least see a large cash settlement with UC.

Fascinating times!

People are wrong about sequencing costs on the internet again

People are wrong about sequencing costs on the internet again and it hurts my face, so I had to write a blog post.

Phil Ashton, whom I like very much, posted this blog:

But the words are all wrong 😉  I’ll keep this short:

  • COST is what it COSTS to do something.  It includes all COSTS.  The clue is in the name.  COST. It’s right there.
  • PRICE is what a consumer pays for something.

These are not the same thing.

As a service provider, if the PRICE you charge to users is lower than your COST, then you are either SUBSIDISED or LOSING MONEY, and are probably NOT SUSTAINABLE.

COST, amongst other things, includes:

  • Reagents
  • Staff time
  • Capital cost or replacement cost (sequencer and compute)
  • Service and maintenance fees
  • Overheads
  • Rent

Someone is paying these, even if it’s not the consumer.  So please – when discussing sequencing – distinguish between PRICE and COST.

Thank you 

The unbearable madness of microbiome

This is my attempt to collate the literature on how easy it is to introduce bias into microbiome studies.  I hope to crowd-source papers and add them below under each category.  PLEASE GET INVOLVED.  If this works well we can turn it into a comprehensive review and publish 🙂 Add new papers in the comments or Tweet me 🙂

Special mention to these blogs:

  1. Microbiomdigest’s page on Sample Storage
  2. Microbe.net’s Best practices for sample processing and storage prior to microbiome DNA analysis freeze? buffer? process?
  3. The-Scientist.com’s Spoiler Alert

****************************************

UPDATE 7th August 2016

****************************************

For prosperity the original blog post is below, but I am now trying to manage this through a ZOTERO GROUP LIBRARY.  Please continue to contribute – join the group, get involved!  I think you can join the group here.  I am trying to avoid just listing software papers BTW I would prefer to focus on papers that specifically demonstrate sources of bias.

I won’t be updating the text below.


General

  • “We propose that best practice should include the use of a proofreading polymerase and a highly processive polymerase, that sequencing primers that overlap with amplification primers should not be used, that template concentration should be optimized, and that the number of PCR cycles should be minimized.” (doi:10.1038/nbt.3601)

Sample collection and sample storage

  • “Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples” (24161897)
  • “No significant differences occurred in the culture-based analysis between the fresh, snap or -80°C frozen samples.” (25748176)
  • “In seven of nine cases, the Firmicutes toBacteroidetes 16S rRNA gene ratio was significantly higher in fecal samples that had been frozen compared to identical samples that had not….. The results demonstrate that storage conditions of fecal samples may adversely affect the determined Firmicutes to Bacteroidetes ratio, which is a frequently used biomarker in gut microbiology.” (22325006)
  • “Our results indicate that environmental factors and biases in molecular techniques likely confer greater amounts of variation to microbial communities than do differences in short-term storage conditions, including storage for up to 2 weeks at room temperature” (10.1111/j.1574-6968.2010.01965.x)
  • “The trichloroacetic acid preserved sample showed significant loss of protein band integrity on the SDS-PAGE gel. The RNAlaterpreserved sample showed the highest number of protein identifications (103% relative to the control; 520 ± 31 identifications in RNAlater versus 504 ± 4 in the control), equivalent to the frozen control. Relative abundances of individual proteins in the RNAlater treatment were quite similar to that of the frozen control (average ratio of 1.01 ± 0.27 for the 50 most abundant proteins), while the SDS-extraction buffer, ethanol, and B-PER all showed significant decreases in both number of identifications and relative abundances of individual proteins.” (10.3389/fmicb.2011.00215)
  • “Optimized Cryopreservation of Mixed Microbial Communities for Conserved Functionality and Diversity” (10.1371/journal.pone.0099517)
  • “A previously known bias in FTA(®) cards that results in lower recovery of pure cultures of Gram-positive bacteria was also detected in mixed community samples. There appears to be a uniform bias across all five preservation methods against microorganisms with high G + C DNA. Overall, the liquid-based preservatives (DNAgard(™), RNAlater(®), and DESS) outperformed the card-based methods.” (22974342)
  • “Microbial composition of frozen and ethanol samples were most similar to fresh samples. FTA card and RNAlater-preserved samples had the least similar microbial composition and abundance compared to fresh samples.” (10.1016/j.mimet.2015.03.021)
  • “Bray-Curtis dissimilarity and (un)weighted UniFrac showed a significant higher distance between fecal swabs and -80°C versus the other methods and -80°C samples (p<0.009). The relative abundance of Ruminococcus and Enterobacteriaceae did not differ between the storage methods versus -80°C, but was higher in fecal swabs (p<0.05)” (10.1371/journal.pone.0126685)
  • “We experimentally determined that the bacterial taxa varied with room temperature storage beyond 15 minutes and beyond three days storage in a domestic frost-free freezer. While freeze thawing only had an effect on bacterial taxa abundance beyond four cycles, the use of samples stored in RNAlater should be avoided as overall DNA yields were reduced as well as the detection of bacterial taxa.” (10.1371/journal.pone.0134802)
  • “A key assumption in many studies is the stability of samples stored long term at −80 °C prior to extraction. After 2 years, we see relatively few changes: increased abundances of lactobacilli and bacilli and a reduction in the overall OTU count. Where samples cannot be frozen, we find that storing samples at room temperature does lead to significant changes in the microbial community after 2 days.” (10.1186/s40168-016-0186-x)

DNA extraction

  • “Caution should be paid when the intention is to pool and analyse samples or data from studies which have used different DNA extraction methods.” (27456340)
  • “Samples clustered according to the type of extracted DNA due to considerable differences between iDNA and eDNA bacterial profiles, while storage temperature and cryoprotectants additives had little effect on sample clustering” (24125910)
  • “Bifidobacteria were only well represented among amplified 16S rRNA gene sequences when mechanical disruption (bead-beating) procedures for DNA extraction were employed together with optimised “universal” PCR primers” (26120470)
  • Qiagen DNA stool kit is biased (misses biffidobacteria)  (26120470)
  • “Bead-beating has a major impact on the determined composition of the human stool microbiota.  Different bead-beating instruments from the same producer gave a 3-fold difference in the Bacteroidetes to Firmicutes ratio” (10.1016/j.mimet.2016.08.005)
  • “We observed that using different DNA extraction kits can produce dramatically different results but bias is introduced regardless of the choice of kit.” (10.1186/s12866-015-0351-6)

Sequencing strategy

  • Bifidobacteria were only well represented among amplified 16S rRNA gene sequences … with optimised “universal” PCR primers. These primers incorporate degenerate bases at positions where mismatches to bifidobacteria and other bacterial taxa occur” (26120470)
  • Anything other than 2x250bp sequencing of V4 region (approx 250bp in length) inflates number of OTUs (23793624)
  • “This study demonstrates the potential for differential bias in bacterial community profiles resulting from the choice of sequencing platform alone.” (10.1128/AEM.02206-14)
  • “The effects of DNA extraction and PCR amplification for our protocols were much larger than those due to sequencing and classification” (10.1186/s12866-015-0351-6)
  • “Nested PCR introduced bias in estimated diversity and community structure. The bias was more significant for communities with relatively higher diversity and when more cycles were applied in the first round of PCR” (10.1371/journal.pone.0132253)
  • “pyrosequencing errors can lead to artificial inflation of diversity estimates” (19725865)
  • “Our findings suggest that when alternative sequencing approaches are used for microbial molecular profiling they can perform with good reproducibility, but care should be taken when comparing small differences between distinct methods” (25421243)

Data analysis strategy

 

Bioinformatics / database issues

 

Contamination in kits

  • “Reagent and laboratory contamination can critically impact sequence-based microbiome analyses” (25387460)
  • “Due to contamination of DNA extraction reagents, false-positive results can occur when applying broad-range real-time PCR based on bacterial 16S rDNA” (15722157)
  • “Relatively high initial densities of planktonic bacteria (10(2) to 10(3) bacteria per ml) were seen within [operating ultrapure water treatment systems intended for laboratory use]” (8517737)
  • “Sensitive, real-time PCR detects low-levels of contamination by Legionella pneumophila in commercial reagents” (16632318)
  • “Taq polymerase contains bacterial DNA of unknown origin” (2087233)

 

Commercial stuff

 

 

 

 

 

Bibliography

Not providing feedback on rejected research proposals is a betrayal of PIs, especially early career researchers

Those of you who follow UK research priorities can’t have missed the creation of the Global Challenges Research Fund (GCRF).  Over the last few months the research councils in the UK have been running the first GCRF competition, a two stage process where preliminary applications are sifted and only certain chosen proposals are allowed to go through to the second stage.   We put one in, and like many others didn’t make the cut.  I don’t mind this, rejection is part of the process; however I do worry about this phrase in the rejection email:

Please note specific feedback on individual outlines will not be provided.

Before I go on, I want to launch an impassioned defense of the research councils themselves.  Overall I think they do a very good job of a very complex task.  They must receive many hundreds of applications, perhaps thousands, every year and they ensure each is reviewed, discussed at committee and a decision taken.  Feedback is provided and clearly UK science punches above its weight in global terms, so they must be doing something right.  They are funding the right things.

I’m also aware that they have had their budgets cut to the bone over the last decade and by all accounts (anecdotal so I can’t provide links) Swindon office has been cut to the bare minimum needed to have a functional system.  In other words, in the face of cuts they have protected research spending.  Good work 🙂

I kind of knew GCRF was in trouble when I heard there had been 1400 preliminary applications.  £40M pot with expected grants of £600k means around 66 will be funded.  That’s quite a sift!

The argument will go that, with that sheer number of applications there is no way the research councils can provide feedback.  Besides, it was a preliminary application anyway, so it matters less.

I couldn’t disagree more, on both accounts.

First of all lets deal with the “preliminary” application thing.  Here is what had to happen to get our preliminary application together:

  • Initial exchange of ideas via e-mail, meetings held, coffee drunk
  • Discussions with overseas collaborators in ODA country via skype
  • 4-page concept note submitted to Roslin Science Management Group (SMG)
  • SMG discussed at weekly meeting, feedback provided
  • Costings form submitted and acted on by Roslin finance
  • Quote for sequencing obtained from Edinburgh Genomics
  • Costings provided by two external partners, including partner in ODA country
  • Drafts circulated, commented on, acted on.
  • BBSRC form filled in (almost 2000 words)
  • Je-S form filled in, CV’s gathered, formatted and attached, form submitted

In actual fact this is quite a lot of work.  I wouldn’t want to guess at the cost.

Do we deserve some feedback?  Yes.   Of course we do.

When my collaborators ask me why this was rejected, what do I tell them?  “I don’t know”?  Really?

Secondly, let’s deal with the “there were too many applications for us to provide feedback” issue.  I have no idea how these applications were judged internally.  I am unsure of the process.  However, someone somewhere read it; they judged it; they scored it; forms were filled in; bullet points written; e-mails sent; meetings had; a ranked list of applications was created; somewhere, somehow, information about the quality of each proposal was created – why can we not have access to that information?  Paste it into an e-mail and click send.  I know it takes a bit of work, but we put in a lot of work too, as did 1400 other applications.  We deserve feedback.

At the moment we just don’t know – was the science poor?  Not ODA enough?  Not applied enough?  Too research-y?  Too blue sky?  Wrong partners?  We are floundering here and we need help.

Feedback to failed proposals is essential.  It is essential for us to improve, especially for young and mid- career researchers, the ones who haven’t got secure positions, who are being judged on their ability to bring in money.  Failure is never welcome, but feedback always is.  It helps us understand the decision making processes of committees so we can do better next time.  We are always being told “request feedback” so when it doesn’t come it feels like a betrayal.  How can we do better if we don’t know how and why we failed?

So come on research councils; yes you do a great job; yes you fund great science, in the face of some savage budget cuts.  But please, think again on your decision to not provide feedback.  We need it.

Open analysis of ZiBRA project MinION data

The ZiBRA project aims to travel around Brazil, collecting Zika samples as they go and sequencing them in “real time” using Oxford Nanopore’s MinION.  They released the first data yesterday, and I have put together some quick scripts to analyse that data and produce a consensus sequence.

First we get the data:


# get Zika reference
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_012532.1&rettype=fasta&retmode=txt" > zika_genome.fa

# get the data
wget -q http://s3.climb.ac.uk/nanopore/Zika_MP1751_PHE_Long_R9_2D.tgz

# surprise, it's not zipped
tar xvf Zika_MP1751_PHE_Long_R9_2D.tgz

Extract 2D FASTQ/A, calculate read lengths, extract metadata:


# extract 2D FASTQ
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.2D.fastq 
extract2D Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.2D.fastq 

# convert to FASTA
porefq2fa zika.pass.2D.fastq > zika.pass.2D.fasta
porefq2fa zika.fail.2D.fastq > zika.fail.2D.fasta

# read lengths
cat zika.pass.2D.fastq | paste - - - - | awk '{print $3}' | awk '{print length($1)}' > zika.pass.2D.readlengths.txt

# extract metadata
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/pass/ > zika.pass.meta.txt
extractMeta Zika_MP1751_PHE_Long_R9_2D/downloads_rnn_wf839/fail/ > zika.fail.meta.txt

Align to reference genome using BWA


# bwa index the genome
bwa index zika_genome.fa

# samtools index the genome
samtools faidx zika_genome.fa

# run bwa and pipe straight to samtools to create BAM
bwa mem -M -x ont2d zika_genome.fa zika.pass.2D.fastq | \
        samtools view -T zika_genome.fa -bS - | \
        samtools sort -T zika.pass_vs_zika_genome.bwa -o zika.pass_vs_zika_genome.bwa.bam -

samtools index zika.pass_vs_zika_genome.bwa.bam

Align to reference genome using LAST


# create a LAST db of the reference genome
lastdb zika_genome zika_genome.fa

# align high quaklity reads to reference genome with LAST
lastal -q 1 -a 1 -b 1 zika_genome zika.pass.2D.fasta > zika.pass_vs_zika_genome.maf

# convert the MAF to BAM with complete CIGAR (matches and mismatches)
python nanopore-scripts/maf-convert.py sam zika.pass_vs_zika_genome.maf | \
    samtools view -T zika_genome.fa -bS - | \
    samtools sort -T zika.pass_vs_zika_genome.last -o zika.pass_vs_zika_genome.last.bam -

samtools index zika.pass_vs_zika_genome.last.bam

Count errors using Aaron Quinlan’s excellent script:


# count errors
python nanopore-scripts/count-errors.py zika.pass_vs_zika_genome.last.bam > zika.pass_vs_zika_genome.last.txt

Produce a consensus sequence:


# produce consensus
samtools mpileup -vf zika_genome.fa zika.pass_vs_zika_genome.bwa.bam | bcftools call -m -O z - > allsites.vcf.gz
bcftools index allsites.vcf.gz
bcftools consensus -f zika_genome.fa allsites.vcf.gz > zika.minion.consensus.fasta

And finally, produce some QC images:


# QC images
./scripts/qc.R

This is all in a GitHub repo; I’m pretty sure the results produced here aren’t going to be used for anything serious, but this serves as a simple, exemplar, self-contained and reproducible analysis.

Converting from GenBank format to FASTA using Microsoft Excel

Should be a popular one this 🙂

First of all we need an example.  I’m using R23456, which we can download from NCBI.  On that page, look towards the top-right, click “Send To”, choose “File”, leave format as “GenBank (full)” and click “Create File”.  This should download to your computer.

In Excel, click File -> Open, navigate to the folder you downloaded the GenBank sequence to, make sure “All files (*.*)” is chosen in the File Open dialogue box and choose the recently downloaded file (it is almost certainly called sequence.gb)

This should immediately open up the Excel “Text Import Wizard”.  Just hit Finish.  You should see something like:

excel

Now, scroll down to the sequence data, which in my version is in cells A55 to A61.  Select all of those cells.

Select the Data menu and then click “Text to columns”

test2columns

This brings up another wizard, but again you can just click finish.  The sequence data is now in B55 through E61.

Now click in cell I55 and type “=B55&C55&D55&E55&F55&G55”.  Hit enter.  Now click on the little green square and drag down to I61 so we have sequence data for all of the cells.

Now we’re getting close!

Click cell A1 and then repeat the “Text to Columns” trick.

Now go back to down to cell I54 and type “=’>’&B1” and hit enter.  We now have something that looks a lot like FASTA!

Final step – select all cells in the range I54:I61, hit Ctrl-C, then hit the “+” at the bottom to add a new sheet, and in the new sheet hover over cell A1, right-click your mouse and choose Paste Special.

In the “Paste Special” dialogue click the radio button next to “Values” and hit OK.

Now click File -> Save As, navigate to a suitable folder, make sure “Save as type” is set to “Text (Tab delimited) (*.txt)”, give it a filename and hit Save.  Click “OK” and “Yes” to Excel’s next two questions.

Go look at the file, it’s FASTA!

fasta

Awesome!  Who needs bioinformaticians eh?

(and if you use this, please never, ever speak to me.  Thanks 🙂)

© 2020 Opiniomics

Theme by Anders NorenUp ↑