Simulating the perfect scenario

Let’s get our hands dirty in R.  I’m going to simulate 10 genes, each between 200 and 10000 nucleotides long and with between 0 and 50000 molecules in the mix:


gmat <- data.frame(id=paste("G",1:10,sep=""),
 length=as.integer(runif(10, min=200, max=10001)), 
 nmol=as.integer(runif(10, min=0, max=50001)),
 stringsAsFactors=FALSE)

This is what my data look like

> gmat
    id length  nmol
1   G1   5432  3206
2   G2   7177 43621
3   G3   6189 48670
4   G4   8529   635
5   G5   2576 27919
6   G6   8462 33280
7   G7   4316 18735
8   G8   3600 41457
9   G9   4973 21935
10 G10   9742 26207

What I want to do now is calculate the total amount of sequence space each gene represents, and use that to calculate a probability of RNA-Seq reads coming from that gene.  A simple example: if there are 10 copies of a gene in the mix, each of 1000 nucleotides long, then the sequence space of that gene is 10,000 nucleotides.

gmat$sspace <- gmat$length * gmat$nmol
ssum <- sum(gmat$sspace)
gmat$p <- gmat$sspace/ssum

My data now look like this:

> gmat
    id length  nmol    sspace          p
1   G1   5432  3206  17414992 0.01098634
2   G2   7177 43621 313067917 0.19750063
3   G3   6189 48670 301218630 0.19002544
4   G4   8529   635   5415915 0.00341666
5   G5   2576 27919  71919344 0.04537072
6   G6   8462 33280 281615360 0.17765861
7   G7   4316 18735  80860260 0.05101114
8   G8   3600 41457 149245200 0.09415216
9   G9   4973 21935 109082755 0.06881546
10 G10   9742 26207 255308594 0.16106284

My p-values should sum to one, and they do:

> sum(gmat$p)
[1] 1

Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix.  Therefore, 30million multiplied by my p-values will give me my RNA-Seq counts:

gmat$nreads <- round(30000000 * gmat$p)

And I can now calculate my FPKM values:

gmat$fpkm <- (gmat$nreads / (gmat$length / 1000)) / 30

My data now look like this:

> gmat
    id length  nmol    sspace          p  nreads       fpkm
1   G1   5432  3206  17414992 0.01098634  329590  2022.5209
2   G2   7177 43621 313067917 0.19750063 5925019 27518.5509
3   G3   6189 48670 301218630 0.19002544 5700763 30703.7388
4   G4   8529   635   5415915 0.00341666  102500   400.5941
5   G5   2576 27919  71919344 0.04537072 1361121 17612.8500
6   G6   8462 33280 281615360 0.17765861 5329758 20994.8719
7   G7   4316 18735  80860260 0.05101114 1530334 11819.0767
8   G8   3600 41457 149245200 0.09415216 2824565 26153.3805
9   G9   4973 21935 109082755 0.06881546 2064464 13837.8180
10 G10   9742 26207 255308594 0.16106284 4831885 16532.8309

Satisfyingly, I have a nice linear relationship between the number of molecules in the mix, and my calculated FPKM:

fpkmnmol

We can calculate the relationship between these (forcing the model through the origin):


> lm(nmol ~ 0 + fpkm, data = gmat)

Call:
lm(formula = nmol ~ 0 + fpkm, data = gmat)

Coefficients:
 fpkm  
1.585  

So, in our model, nmol = 1.585 * fpkm, and we can add this line to the plot:

p + geom_point() + geom_abline(intercept=0, slope=1.585, col="red")

fpkmnmolline

(please do not use this formula more generically – it really only works in this simple simulation!)

So for perfect simulated data, the number of molecules (transcripts) in the original mix correlates perfectly with FPKM.

(unless I have made any huge errors or incorrect assumptions, which is possible)