## Opiniomics

### bioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"

Simulating the perfect scenario

Let’s get our hands dirty in R.  I’m going to simulate 10 genes, each between 200 and 10000 nucleotides long and with between 0 and 50000 molecules in the mix:

```
gmat <- data.frame(id=paste("G",1:10,sep=""),
length=as.integer(runif(10, min=200, max=10001)),
nmol=as.integer(runif(10, min=0, max=50001)),
stringsAsFactors=FALSE)

```

This is what my data look like

```> gmat
id length  nmol
1   G1   5432  3206
2   G2   7177 43621
3   G3   6189 48670
4   G4   8529   635
5   G5   2576 27919
6   G6   8462 33280
7   G7   4316 18735
8   G8   3600 41457
9   G9   4973 21935
10 G10   9742 26207
```

What I want to do now is calculate the total amount of sequence space each gene represents, and use that to calculate a probability of RNA-Seq reads coming from that gene.  A simple example: if there are 10 copies of a gene in the mix, each of 1000 nucleotides long, then the sequence space of that gene is 10,000 nucleotides.

```gmat\$sspace <- gmat\$length * gmat\$nmol
ssum <- sum(gmat\$sspace)
gmat\$p <- gmat\$sspace/ssum```

My data now look like this:

```> gmat
id length  nmol    sspace          p
1   G1   5432  3206  17414992 0.01098634
2   G2   7177 43621 313067917 0.19750063
3   G3   6189 48670 301218630 0.19002544
4   G4   8529   635   5415915 0.00341666
5   G5   2576 27919  71919344 0.04537072
6   G6   8462 33280 281615360 0.17765861
7   G7   4316 18735  80860260 0.05101114
8   G8   3600 41457 149245200 0.09415216
9   G9   4973 21935 109082755 0.06881546
10 G10   9742 26207 255308594 0.16106284
```

My p-values should sum to one, and they do:

```> sum(gmat\$p)
[1] 1```

Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix.  Therefore, 30million multiplied by my p-values will give me my RNA-Seq counts:

`gmat\$nreads <- round(30000000 * gmat\$p)`

And I can now calculate my FPKM values:

`gmat\$fpkm <- (gmat\$nreads / (gmat\$length / 1000)) / 30`

My data now look like this:

```> gmat
id length  nmol    sspace          p  nreads       fpkm
1   G1   5432  3206  17414992 0.01098634  329590  2022.5209
2   G2   7177 43621 313067917 0.19750063 5925019 27518.5509
3   G3   6189 48670 301218630 0.19002544 5700763 30703.7388
4   G4   8529   635   5415915 0.00341666  102500   400.5941
5   G5   2576 27919  71919344 0.04537072 1361121 17612.8500
6   G6   8462 33280 281615360 0.17765861 5329758 20994.8719
7   G7   4316 18735  80860260 0.05101114 1530334 11819.0767
8   G8   3600 41457 149245200 0.09415216 2824565 26153.3805
9   G9   4973 21935 109082755 0.06881546 2064464 13837.8180
10 G10   9742 26207 255308594 0.16106284 4831885 16532.8309

```

Satisfyingly, I have a nice linear relationship between the number of molecules in the mix, and my calculated FPKM:

We can calculate the relationship between these (forcing the model through the origin):

```
> lm(nmol ~ 0 + fpkm, data = gmat)

Call:
lm(formula = nmol ~ 0 + fpkm, data = gmat)

Coefficients:
fpkm
1.585
```

So, in our model, nmol = 1.585 * fpkm, and we can add this line to the plot:

`p + geom_point() + geom_abline(intercept=0, slope=1.585, col="red")`

(please do not use this formula more generically – it really only works in this simple simulation!)

So for perfect simulated data, the number of molecules (transcripts) in the original mix correlates perfectly with FPKM.

(unless I have made any huge errors or incorrect assumptions, which is possible)

1. Yes, this makes sense. However, I have a question regarding why normalizing by the length of the gene matters if you only want to do relative expression between samples. i.e. if the gene / transcript is the same length (which it is when you are comparing the same gene across samples), it makes sense to normalize by total number of reads in each sample, but it seems to me that the length of the gene is irrelevant in that case.

2. One reason is you are frequently interested in the question, “which of two genes is more actively transcribed?” Imagine BigGene and LittleGene are both expressed at the same real levels, but BigGene is 10x larger. If you don’t normalize for gene length, you’ll see note a higher number of reads was assigned to BigGene than to LittleGene and conclude that BigGene is expressed at higher levels, even if that’s not the case.

3. Hi,
I run your code 10 times and got 10 different fpkm coefficients (ranging from 0.7 to 2.5 in my case). So although you could use FPKM within one sample to calculate the relative expression between genes (assuming no biases etc) you can’t use it to compare between samples. The same FPKM values will mean different things in different samples (as they have different coefficients). With this in mind you could equally well use FPK (without the “M”) and ignore the number of reads you have (as this is constant for all genes in a sample).

Have a look at this blog and the paper mentioned, I find it really informative. http://blog.nextgenetics.net/?e=51

Best,
Elin

• Sure, but that’s only ten genes, so it’s going to be varied.

If I re-run the code, but generate FPKMs for 10,000 genes, and calculate the coefficient 10,000 times, I get a mean value of 1275 and a standard deviation of 11. Min value is 1237 and max value is 1316. So pretty constant…. obviously with assumptions.

• Did you also try to change the distribution of concentrations? In your code you have nmol=as.integer(runif(10, min=0, max=50001)), of course, by changing the 10 genes to 10,000 you will end up with very similar datasets. Therefore you basically compare the same samples over and over again. Hence the similar coefficient.

For your statement (that FPKM makes sense) to hold you need to show that you get similar coefficient not only when you assume that gene expression is uniform etc, but that it more generally holds.

Try to run e.g. nmol=as.integer(runif(10000, min=0, max=100001)) or even nmol=as.integer(runif(10000, min=0, max=55001)) and compare it to your 1275 mean coefficient value. Then use e.g. a non-uniform distribution and see what happens. I am also not sure that you can assume that the number of genes in different samples are also the same… so maybe try to vary this variable too.

As pointed out by Heinz Ekker, Lior Pachter (who introduced FPKM in the first place!!) explains excellently in his lecture why FPKM is the wrong unit, and he suggests that one should use TPM instead as this at least theoretically makes sense.

One physics teacher gave us the advise to always do a unit analysis to check our calculations. If you do a unit analysis on FPKM you will see that the unit is 1/bp. As the relative expression (which we want to measure with FPKM) is a fraction measure it should be unit free. TPM is unit free and hence a more natural unit.

Of course if you manage to show that in reality this theoretical error does not matter, that in all samples you’ll ever analyse the coefficient will indeed be the same, then you can use FPKM. But why would you want to make it so complicated?

Using FPKM would be like measuring the strength of beer in % divided by the cloudiness of the beer, arguing that all beers you will ever investigate have very cloudiness.

Best,
Elin

• Last sentence should be: “…very _similar_ cloudiness.”

• The variation of scaling factor between FPKMs and TPMs (which do not suffer this inconsistency between samples and are also proportional to the number of molecules in the mix) was studied in 10 different samples by Wagner et al.: http://www.ncbi.nlm.nih.gov/pubmed/22872506
In Table 1, you can see that within the same species, it can vary substantially (from 2.61 to 3.06).
What is wrong with using TPMs?

4. I’m not sure why you are doing the regression part; when you simplify your steps (substitute variables), you end up with something like:
gmat\$fpkm<-1e9/sum(gmat\$length*gmat\$nmol)*gmat\$nmol
where sum(gmat\$length*gmat\$nmol)/1e9 unsurprisingly works out to be 1.585.

This constant is precisely the problem with F/RPKM values, since it varies between experiments and you cannot use them to compare. Lior Pachter has an excellent lecture on this at CSHL:

(from 30:00 on)

• I write my blog posts how the logic works in my head, rather than the most efficient (or correct!) code.

Has anyone worked on *how* much the constant varies between experiments? And ultimately, as long as it is constant within an experiment, there isn’t a problem for within-experiment comparison. And either way, FPKM is often only used for display, differential expression is carried out on counts

5. Here the main assumption is that a gene has all its transcripts of the same length!!!! That is completely wrong for eukaryotes (e.g. human, dog, rat, mouse). A gene can have several transcripts of several lengths! Also in a RNA-seq experiment one is sequencing RNA and not genes or genomes. Therefore it makes sense to compute FPKM/RPKM only and only for transcripts because for them we know very precisly their lengths. There is no correlation between gene length and transcript length and therefore “gene length” is basically articificial/sythetic noise! Therefore it is not recommended to use gene length! One may compute FPKM/RPKM for genes by computing them first for transcripts and adding (or choose our favourite operation) all the RPKM/FPKM of transcripts which belong to a gene.
Indeed one may play with the numbers but using “gene length” has no meaning whatsoever!

6. The problem I see is this assumption:
“Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix. ”

The problem is that the nature of RNA-Seq samples transcripts according to a Poisson distribution (perfect world), or more generally a negative binomial distribution, or even a mixture of NBs.

The assumption made above implies that the FPKM approach will work out (up to a constant), and therefore is circular in the argument it tries to prove.