Simulating the perfect scenario
Let’s get our hands dirty in R. I’m going to simulate 10 genes, each between 200 and 10000 nucleotides long and with between 0 and 50000 molecules in the mix:
gmat <- data.frame(id=paste("G",1:10,sep=""), length=as.integer(runif(10, min=200, max=10001)), nmol=as.integer(runif(10, min=0, max=50001)), stringsAsFactors=FALSE)
This is what my data look like
> gmat id length nmol 1 G1 5432 3206 2 G2 7177 43621 3 G3 6189 48670 4 G4 8529 635 5 G5 2576 27919 6 G6 8462 33280 7 G7 4316 18735 8 G8 3600 41457 9 G9 4973 21935 10 G10 9742 26207
What I want to do now is calculate the total amount of sequence space each gene represents, and use that to calculate a probability of RNA-Seq reads coming from that gene. A simple example: if there are 10 copies of a gene in the mix, each of 1000 nucleotides long, then the sequence space of that gene is 10,000 nucleotides.
gmat$sspace <- gmat$length * gmat$nmol ssum <- sum(gmat$sspace) gmat$p <- gmat$sspace/ssum
My data now look like this:
> gmat id length nmol sspace p 1 G1 5432 3206 17414992 0.01098634 2 G2 7177 43621 313067917 0.19750063 3 G3 6189 48670 301218630 0.19002544 4 G4 8529 635 5415915 0.00341666 5 G5 2576 27919 71919344 0.04537072 6 G6 8462 33280 281615360 0.17765861 7 G7 4316 18735 80860260 0.05101114 8 G8 3600 41457 149245200 0.09415216 9 G9 4973 21935 109082755 0.06881546 10 G10 9742 26207 255308594 0.16106284
My p-values should sum to one, and they do:
> sum(gmat$p) [1] 1
Now, let’s assume that we have 30 million RNA-Seq reads, and that those reads have an equal chance of coming from any of the copies of any of the genes in the mix. Therefore, 30million multiplied by my p-values will give me my RNA-Seq counts:
gmat$nreads <- round(30000000 * gmat$p)
And I can now calculate my FPKM values:
gmat$fpkm <- (gmat$nreads / (gmat$length / 1000)) / 30
My data now look like this:
> gmat id length nmol sspace p nreads fpkm 1 G1 5432 3206 17414992 0.01098634 329590 2022.5209 2 G2 7177 43621 313067917 0.19750063 5925019 27518.5509 3 G3 6189 48670 301218630 0.19002544 5700763 30703.7388 4 G4 8529 635 5415915 0.00341666 102500 400.5941 5 G5 2576 27919 71919344 0.04537072 1361121 17612.8500 6 G6 8462 33280 281615360 0.17765861 5329758 20994.8719 7 G7 4316 18735 80860260 0.05101114 1530334 11819.0767 8 G8 3600 41457 149245200 0.09415216 2824565 26153.3805 9 G9 4973 21935 109082755 0.06881546 2064464 13837.8180 10 G10 9742 26207 255308594 0.16106284 4831885 16532.8309
Satisfyingly, I have a nice linear relationship between the number of molecules in the mix, and my calculated FPKM:
We can calculate the relationship between these (forcing the model through the origin):
> lm(nmol ~ 0 + fpkm, data = gmat) Call: lm(formula = nmol ~ 0 + fpkm, data = gmat) Coefficients: fpkm 1.585
So, in our model, nmol = 1.585 * fpkm, and we can add this line to the plot:
p + geom_point() + geom_abline(intercept=0, slope=1.585, col="red")
(please do not use this formula more generically – it really only works in this simple simulation!)
So for perfect simulated data, the number of molecules (transcripts) in the original mix correlates perfectly with FPKM.
(unless I have made any huge errors or incorrect assumptions, which is possible)