Expression measures need to be normalized to remove biases that can be introduced during sequencing, such as sequencing depth and length of the RNA transcript. Geneious calculates three expression level measures on individual samples, which are normalized so that genes expressed in the same sample can be compared:
RPKM
Reads per kilobase per million normalizes the raw count by transcript length and sequencing depth.
RPKM = (CDS read count * 109) / (CDS length * total mapped read count)
FPKM
Same as RPKM except if the data is paired then only one of the mates is counted, ie. fragments are counted rather than reads.
TPM
Transcripts per million (as proposed by Wagner et al 2012) is a modification of RPKM designed to be consistent across samples. It is normalized by total transcript count instead of read count in addition to average read length.
TPM = (CDS read count * mean read length * 106) / (CDS length * total transcript count)
Counting
The above metrics are calculated by normalizing the count of reads that map to each CDS annotation. If a read at least partially intersects at least one interval from a CDS annotation, then it will be treated as though that read mapped to that CDS annotation. For reads that map to multiple locations, or reads that map to a location that intersect multiple CDS annotations, these may either be counted as partial matches, excluded from the calculations, or counted as full matches to each location they map to. We recommend counting reads as partial matches, for example if a read maps to two locations, then it will be counted as if 0.5 reads mapped to each of the two locations. When calculating statistics, reads that don’t map or map outside of a CDS annotation are ignored.
For comparing across samples additional normalization is required, as different samples may contain different quantities of transcript. The choice of normalization method determines the Differential Expression Ratio for each gene. The following normalization methods are available in Geneious:
All of these normalization methods (and more) are described and compared by Dillies et al 2012, and they recommend using Median of Gene Expression Ratios rather than the other three normalization methods implemented here. One reason for this is that a few highly expressed genes can greatly affect the total number of transcripts produced, so this can distort the fraction of the total reads that contribute to genes with lower expression.
Values to Compare
Either read counts, fragment counts, or transcript counts from each annotation can be compared. Since a single transcript can produce multiple reads and fragments, the number of reads and fragments produced aren’t independent events so the confidence values produced by comparing these are unlikely to be accurate. For this reason we recommend comparing samples using transcript counts.
P-Value Calculation
In addition to calculating the differential expression ratio, it is useful to know whether or not that differential expression is statistically significant. This is represented by a p-value. A number of advanced methods have been published for the calculation of p-values based on a range of assumptions. Many of these are compared by Soneson & Delorenzi 2013 and they conclude that no single method is optimal under all circumstances and that very small samples sizes impose problems for all evaluated methods.
In this basic differential expression plugin in Geneious we have implemented a simple statistical test based on the assumption that the gene which each observed transcript came from is an independent event.
For a given gene, the probability that a randomly selected transcript would come from that gene is calculated as number of transcripts mapped to that gene/total number of transcripts from that sample. This probability is normalized, the mean probability between the two samples calculated, and this mean un-normalized for each sample. This produces an expected probability that a randomly selected transcript from this sample comes from that gene, assuming that this gene is not differentially expressed.
The Binomial Distribution is used to calculate the probability that an observed count at least as extreme as the observed one would be seen, assuming this non-differentially expressed mean probability. The probabilities from each sample are multiplied together to form the p-value.