FPKM and TPM Units for RNA Seq after Pachter

1 minute read

See Pachter, L. Models for transcript quantification from RNA-seq.

Count-based modelsPermalink

RPKM/FPKMPermalink

Simplest version: transcriptome is a set of transcripts with different abundances, and a read is produced by choosing a site in a transcript for the read uniformly at random from among all positions.

  • T is the set of transcripts.
  • (t) is the length of transcript t for tT.
  • ρ(t) is the relative abundance of transcript t in the transcriptome. (That is, it is the number of mRNA molecules of type t, over the total number of such molecules).
  • m is the length of the reads
  • F is the set of all possible reads and Ft is the set of reads that map to transcript t.
  • The effective length ˜(t) is (t)m+1, the total number of places that a read of length m could begin.

What’s the chance of choosing a read from a transcript t? Let mt be the number of transcripts of type t and M=mt be the total number of transcripts. The number of reads coming from mt is mt˜(t). The total number of reads from all transcripts is iTmi˜l(i). Therefore the probability of getting a read from t is αt=mt˜(t)tmt˜(t) and, since ρt=mt/M, this is the same as αt=ρt˜(t)tρt˜(t)

It’s worth observing that αt=1 and the α’s are non-negative.

The α and ρ distributions are related (via the lengths ˜) in the opposite direction by the equation ρt=(αt/~(t))iTαi/˜(i)

The maximum likelihood estimate for the αt are Xt/N where N is the total number of mapped reads and Xt is the number of reads mapped to t.

The maximum likelihood estimate for ρt, which is the relative abundance of transcript t among the expressed transcripts, is (using the above equations) PROPORTIONAL TO ˆρtXtN˜(t). (You need to divide by the sum of the terms on the right to get equality).

The number Xt˜(t)N(109) is the “Reads per Kilobase per millions of mapped reads” because Xt/(N/106) is the “Reads per million mapped reads” and then you divide that by ˜(t)/103 to get “Reads per million mapped reads per kilobase.”

TPMPermalink

The ‘transcripts per million’ measure (maximum likelihood) is 106ρt so to compute TPM you take the abundance measured in RPKM and divide by the sum of RPKM over all the transcripts.

Updated: