Introducing pufferfish An efficient index for the colored, compacted, de Bruijn graph

Note: This post describes some cool data structure work (link to the work-in-progress GitHub repo here) my students @fataltes, @hrksrkr, and I have been doing (pre-print coming soon!).  If you find these ideas interesting, or you have any interesting thoughts or suggestions, I’m sure they’d appreciate a shout out!

Though the de Bruijn Graph (dBG) has enjoyed tremendous popularity as an assembly and sequence comparison data structure, it has only relatively recently begun to see use as an index of the reference sequences (e.g. deBGA, kallisto). Particularly, these tools index the compacted dBG (cdBG), in which all non-branching paths are collapsed into individual nodes and labeled with the string they spell out. This data structure is particularly well-suited for representing repetitive reference sequences, since a single contig in the cdBG represents all occurrences of the repeated sequence. The original positions in the reference can be recovered with the help of an auxiliary “contig table” that maps each contig to the reference sequence, position, and orientation where it appears as a substring. The deBGA paper has a nice description how this kind of index looks (they call it a unipath index, because the contigs we index are unitigs in the cdBG), and how all the pieces fit together to be able to resolve the queries we care about.  Moreover, the cdBG can be built on multiple reference sequences (transcripts, chromosomes, genomes), where each reference is given a distinct color (or colour, if you’re of the British persuasion). The resulting structure, which also encodes the relationships between the cdBGs of the underlying reference sequences, is called the “compacted, colored de Bruijn graph” (ccdBG).  This is not, of course, the only variant of the dBG that has proven useful from an indexing perspective. The (pruned) dBG has also proven useful as a graph upon which to build a path index of arbitrary variation / sequence graphs, which has enabled very interesting and clever indexing schemes like that adopted in GCSA2 (which we won’t discuss further here, but which I hope to cover in more detail in a future post).  Also, thinking about sequence search in terms of the dBG has led to interesting representations for variation-aware sequence search backed by indexes like the vBWT (implemented in the excellent gramtools package).

Here, however, we focus on indices built directly on the ccdBG, and we further focus on schemes based on indexing this structure by hashing the k-mers it contains.  While there are many benefits to other indexing data structures (e.g. full-text indexing structures like those built on the suffix tree, (enhanced) suffix array, FM-index etc.), there are certain very nice properties of hashing-based schemes.  Often, they are simple to understand and think about, and they can generally be made to have small constant factors, making them very fast for lookup (which is one of the primary reasons we’re interested in pursuing such a scheme).

While existing hash-based indices based on the cdBG (and ccdBG) are very efficient for search, they typically occupy a large amount of space in memory (both during construction and once built). As a result, to make use of such data structures on large reference sequences (e.g., the human genome) or collections of reference sequences (e.g., in a metagenomic or microbiomic context), one typically requires a very large memory machine — if the structures can be built at all.

Recently, we’ve been working on a pretty nifty new data structure (at least my students and I think it’s nifty, anyway).  This new and more compact data structure is implemented in our new software,  pufferfish, for indexing the ccdBG. While maintaining very efficient queries, this allows pufferfish to index reference sequences while reducing the memory requirements considerably (by about an order-of-magnitude compared with existing hash-based solutions when we adopt the sparse variant of our index). This greatly reduces the memory burden for indexing reference sequences and makes it possible to build hash-based indexes of sequences of size that were not previously feasible.

Pufferfish supports (i.e., implements) two types of indices — called dense and sparse indices. Both of these make use of an efficient minimum perfect hash implementation to vastly reduce the memory requirements compared to other hashing-based dBG, colored-dBG, and compacted-colored-dBG indices. However, the sparse index saves even more space by only directly storing the position for certain k-mers (and inferring the positions dynamically for the rest).

The basic (dense) pufferfish index

The basic (dense) pufferfish index can be visualized as in the image below:

 

The contigs of the compacted dBG are stored contiguously in a (2-bit encoded) vector CSeq, and a boundary bit vector BV marks the end of each contig. All of the N valid k-mers (i.e. k-mers not crossing contig boundaries) are indexed in a minimum perfect hash function. A k-mer query (i.e., does this k-mer exist in the index, and if so, where does it occur in the reference?) is answered by consulting the MPHF h().  Compared to traditional hashing approaches, this can save a tremendous amount of space, since the hash function spends about 3 bits per key (k-mer), regardless of k (whereas traditional hashes might spend 8-12 bytes, or more, as they have to store the key explicitly and maintain a reasonable load factor). For a k-mer x, if 0 <= h(x) < N, either x != CSeq[pos[h(x)] : pos[h(x)] + k], in which case x does not appear in the ccDBG, or pos[h(x)] stores the position of x in CSeq. Efficient rank and select queries on the BV vector (we’re currently using the excellent SDSL-lite library for such things) can be used to determine the relative offset of x in the contig (say c) where it occurs, and to lookup all occurrences of c in the reference sequence. This information is encoded in the contig reference table (CTab), which records the reference sequence, offset and relative orientation of all occurrences of c in the reference.  Pufferfish also records and stores in the index (though this is not actually required to answer the basic k-mer queries) the equivalence class of each contig in terms of the reference sequences where it appears (i.e. two contigs belong to the same equivalence class if they belong to exactly the same set of transcripts / genomes / etc.).  This information can be useful for fast mapping algorithms.

The sparse pufferfish index

In addition to the tables maintained in the dense index, the sparse index also records 4 other pieces of information. There are 3 extra bit-vectors (IsSamp, IsCanon, and ExtRight) and one extra packed integer vector (QueryExt). The sparse index builds on the observation that any valid k-mer in the cDBG is a distinct k-mer that is one of :

  1. a complex k-mer (it has an in or out degree > 1)
  2. a terminal k-mer (it starts or ends some input reference sequence)
  3. a k-mer with precisely one successor / predecessor

Instead of storing the position for every k-mer (which requires lg(\lvert \texttt{CSeq} \rvert) bits per entry), the sparse index stores positions for only a sampled set of k-mers (those where IsSamp is 1) — so that pos becomes considerably smaller. For k-mers whose positions are not stored, the QueryExt table records which nucleotides should be added to either the beginning (if ExtRight is 0) or end (if ExtRight is 1) of the query k-mer to move toward the nearest sampled position. We also need to know if the variant of the k-mer that appears in the CSeq table is canonical or not (hence the IsCanon vector). In this scheme, successive queries and extensions move toward the nearest sampled k-mer until a position can be recovered, at which point the implied position of the query k-mer can be recovered and the rest of the query process proceeds as in the dense index. By trading off the rate of sampling and the size of the extension that is stored (e.g., 1, 2, 3, … nucleotides), one can trade-off between index size and query speed. A sparse sampling and shorter extension require more queries to locate a k-mer, but result in a smaller index, while a denser sampling and/or longer extensions require fewer queries in the worst case, but more space. If the extension size (e) is  \ge \frac{(s-1)}{2}, where s is the sampling rate, one can guarantee that at most one extra query need be performed when searching for a k-mer. The sparse index is currently in the develop branch of pufferfish, and should make it into master soon.

Here is a visual example of querying for the position of k-mer x=CAGCGGC in CSeq where s = 9 and e = 4. We find that the entry corresponding to x is not sampled, and that IsCanon is false and the corresponding bit for IsCanon says that the k-mer stored in CSeq is not canonical. Since x is canonical, we take the reverse complement to get GCGGCTG. We then consult the ExtRight vector to determine the extension TGTC should be applied to the right of the k-mer GCGGCTG. After applying this extension we get our new key x’ = CTGTCCT, which we then look up again in the MPHF. This time, we see that IsSampled[h(x')] = 1, so we can retrieve the index for x' at p_{x’} = pos[rank(IsSampled[h(x')])] (since we only store the positions for the sampled k-mers). Finally, since we know that we extended the canonical k-mer to the right by 4 bases to obtain x', we look for our original query at pos[rank(IsSampled[h(x')])] - 4, leading us to find (in this case) the reverse complement of our original query key GCGGCTG.

One key here is that if we keep only sampled positions in the pos vector, and the extension entries for only the non-sampled positions in the QueryExt vector, then, for a key x that hashes to an index h(x) = i we can retrieve the position in pos[rank(IsSamp[i])] if IsSamp[i] is 1 and we can retrieve the extension nucleotides at QueryExt[i-rank(IsSamp[i])]otherwise.  This makes access to all of the auxiliary information to perform the sample “jumping” efficient — all we need are lookups on the bitvectors and rank on the IsSamp vector in particular.  So, why does this index save us space?

In the sampled index we require 1 extra bit per valid k-mer (IsSamp).  For each of the non-sampled k-mers we require 2*e (forQueryExt) + 3 (one for each of the new bit vectors) + lg(e) (to remember the extension size) bits.  When one chooses e and s intelligently, this can be considerably less than the storage requirements of the dense index.  Now, note that there is an even more compacted scheme for storing the actual extension characters (the other bit vectors are still required) — one that uses 2*e' +lg(e) bits to store each extension of size e' <= e.    Here, e' is the actual length of the extension, not the maximum extension length, and in expectation this is equal to \frac{1}{2} e. This requires a fast data structure for static prefix-sum, which we haven’t yet implemented (if you know of one with a nice C++ implementation, please let us know!).

Under this sampling scheme, we only store positions for (roughly) 1 out of every s k-mers.  WhenCSeq is large, and these positions require 4,5,…8 bytes each, these savings are considerable.  If one is willing to perform more iterations of the query procedure described above for non-sampled positions, she can make the extension size shorter and the sampling factor larger.  That is, this process can be applied iteratively and one can find the next sampled position in at most \frac{\frac{(s-1)}{2}}{e} extension jumps in the worst case.  This slows down queries by a constant factor, but reduces the size of the index by a similar factor.  We note that, in the extreme case, one could dispense with the extension vector entirely, and simply try each of the 4 possibilities (i.e., \{A,C,G,T\}) for the predecessor / successor of a query; however, we require an extension size of at least 1 for reasons of performance in the pufferfish implementation.

A sneak-peek at pufferfish

Pufferfish is still very much a work in progress.  We’re still in the process of tweaking aspects of the representation to optimize time / memory, and we are benchmarking different data structures for the pufferfish preprint, the first version of which should land soon.  However, the initial results are very promising.  For example, we can take a look at how pufferfish compares in terms of index size and lookup time to the other hashing-based ccdBG indices (those implemented in deBGA and kallisto) for data of two different sizes; the human transcriptome and the human genome.  The former of these is a relatively small reference sequence, but with considerable shared sequence complexity induces (mainly) by alternative splicing.  The latter of these is a reasonably large reference sequence with a tremendous amount of repetition, including some very complex and some simple regions of the compacted dBG.  First, we look at the sizes of the dense and sparse pufferfish indices, as well as that of kallisto and deBGA for the human transcriptome:

Table 1 (transcriptome indexing)

MethodIndex size (on disk)Index construction memory (max_rss)Lookup memory (max_rss)
pufferfish (dense)395M454M446M
pufferfish (sparse)277M345M331M
deBGA2.8G8.1G6.8G
kallisto1.7G3.4G3.3G

So, we see that even on a relatively small reference (the transcriptome reference — gencode v25 protein coding transcripts — is 201MB), and even when using the dense pufferfish index, we obtain considerable savings over the other methods, with the dense pufferfish index being about 6.7x smaller in memory than the closest alternative representation.  Of course, while these numbers represent considerable overhead for the index (relative to the size of the text itself), the transcriptome is small enough that this doesn’t pose too much of a problem.  However, when the size of the reference we wish to index becomes larger, so do our resource requirements … and by a considerable margin.

Table 2 (genome indexing)

MethodIndex size (on disk)Index construction memory (max_rss)Lookup memory (max_rss)
pufferfish (dense)17G22G18.8G
pufferfish (sparse)12G22G13.6G
deBGA*40G*69.6G*40.3G
kallisto57G157G115G

Again, we see a similar trend, except that this time the expansion in the index size considerably increases the memory burden.  These indices can get big fast.  This means that, using existing hashing-based cdBG indices, you may need a reasonably large-memory server to index references like the human genome, and you might not even be able to index much larger references (e.g., much larger genomes or large collections of microbial genomes).

Also, you might notice the ‘*’ next to the deBGA numbers here.  This is because these numbers are approximate and taken from the deBGA github repository.  It’s worth noting that these numbers are for k=22, and so one might expect the proper index to be even larger when using the same parameters we used here for the other tools (k=27).  The reason for this using these numbers is that we ran into an issue when using deBGA to index the human genome.  It produced an index that was, in fact, around the same size as the (dense) pufferfish index.  This was smaller than we’d expected, and a further inspection led us to discover that, despite the fact that index construction completed (i.e., the program ran and reported successful termination), it seemed to be storing information for only 1,332,190,304 distinct k-mers.  However, pufferfish and kallisto both agree that there are 2,579,117,522 distinct k-mers in this particular version of the genome (GRCh38 primary assembly).  So, for some reason, deBGA seems to only be indexing about half of the k-mers here (there could, of course, be user error).  If we “extrapolate” the numbers, however, we can see that doubling the index size (to account for a 2 fold increase in the number of indexed k-mers) would make the index size and memory usage approximately agree with what is reported in the deBGA github repo.  We intend to keep digging to figure out what’s going on here, and have filed the relevant reports at the deBGA GitHub repo, but the numbers in Table 2 above seem to agree fairly well with what is reported in the deBGA repository and what we might expect extrapolating the results we obtained.  Regardless, we can’t do lookup (i.e., speed) tests with deBGA until we find out why it seems to not be locating all of the indexed k-mers.

Lookup Speeds

The other result of interest here is lookup speed — that is, given a query k-mer, how fast the index can recall the reference loci where the k-mer appears (or report if the k-mer doesn’t, in fact, appear in any reference).  You may not always want to recall all the reference positions, and may, instead, wish to recall some associated information for the k-mer (like the equivalence class etc.).  In theory, all three indices discussed above support these queries, though deBGA does not currently compute or store equivalence class information.

Currently, pufferfish is only an indexing data structure and not a full mapper or aligner.  So, to try and make a fair comparison, we benchmarked some query operations for k-mer lookup. Specifically, we retrieved the total number of hits — i.e. total number of reference loci matches — for all of the k-mers in a set of reads.  Currently, we have not figured out how to include deBGA in this benchmark, due mainly to the issue described above (i.e., it seems not to be indexing all of the k-mers and hence doesn’t give a result that agrees with the other tools, precluding a meaningful speed comparison).

Nonetheless, we can compare k-mer lookup speeds to those of kallisto (which are blazingly fast, since it is essentially lookup in an un-compressed hash table of k-mers to the relevant contig information).  We find that the dense pufferfish index has k-mer query performance that is only a little bit (25-30%) slower than that of kallisto.  The extra timing overhead mostly comes from the fact that the minimum perfect hash lookup tends to be a bit slower than what one gets in a normal hash, and that to obtain the contig-relative position of a k-mer from the global position requires a select operation on BV (it also requires rank, but that tends to be verfast).  Lookup in the sparse pufferfish index (with an extension size of 4 and a sampling factor of 9) tends to be about 3-3.5X slower than the dense index.  This is still quite fast, as the average query time per k-mer (for an index over the genome) in the dense index is ~133 ns while for the sparse index this is ~ 468 ns [1]running in a single thread on an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz.  This difference somewhat tracks the size difference regarding the size of positional information being stored (e.g., for the human genome, the dense index uses ~2X more space to store k-mer position information compared to the sparse index — 9.7G vs 4.8G — though the total difference in index size is a smaller factor because other structures remain the same).  However, it’s also the case that there is considerable room for optimization in the sparse query code, so it’s very likely the gap can be closed further.

Given the reduction in space achieved by both the dense and sparse pufferfish indices, we find these initial results very promising.  The dense index can provide a much more compact index than a traditional hash-based approach, with very little sacrifice in speed.  On the other hand, the sparse index can provide an order of magnitude reduction in (in-memory) index size, while still providing very efficient queries, albeit slower than the dense index or traditional hash tables.  Again, it’s worth noting that the sampling scheme adopted by the sparse pufferfish index provides a “knob”.  These numbers represent the “fast” end of the knob (one need perform at most one extra extension and lookup operation for a given k-mer query), but more extreme sampling values have the potential to reduce the space required for storing positional information much further.

One final interesting point worth noting is that, due to it’s smaller size and layout on disk and in memory, the pufferfish index can be faster to load than traditional hash-table-based indices when the reference is large.  This is partly because “de-serializing” a hash-table basically amounts to re-inserting all of the elements, and partly because some of our auxiliary information is stored in a format more appropriate for reading large chunks of memory at once.  This can lead to some interesting scenarios where pufferfish (dense) is actually faster to load the reference and query many k-mers than other indices that otherwise have a slightly faster per-query time.  For example, to load the genome index and query all of the k-mers from a set of 33.4M, 100bp reads (using a single thread), took pufferfish 6m 56s and took kallisto 11m 55s.  Of course, for any type of index that will be used to make queries over many different sets of reads, it would make sense to engineer a shared-memory index so that one need pay the index loading cost only once and can keep the index in memory that can be mapped by multiple processes until all of the querying is complete.  It’s also true, of course, that the engineering effort that needs to go into a cross-platform and robust shared-memory implementation is non-trivial ;P.

Some practical considerations and upcoming improvements

So, the pufferfish index already looks like it’s doing a pretty good job at providing an efficient ccdBG index in terms of both memory usage and query performance.  However, there are still quite a few places for improvement.  For example, right now, the contig table stores the list of reference sequences, offsets and orientations for each contig using a vector of vectors where the reference sequence id and offset are stored with fixed-width integers.  This potentially wastes considerable space when the number of reference sequences is < 4,000,000,000 and when the longest reference sequence (i.e., transcript or chromosome) is < 4,000,000,000.  By replacing this array of structures layout (AoS) with a structure of arrays (SoA), we can pack the reference id and reference offsets into the appropriate number of bits, thereby reducing the space of the contig table.  Further, there is still considerable room for improving the performance of querying in the sparse index.  We’ll be working on this in the immediate future, and then likely turning the pufferfish index into a fully-fledged mapper / aligner.  We are also very interested in exploring how this data structure, and algorithms we can build on top of it, will be useful when indexing a large collection of different sequences — there seem to be many potentially interesting applications in metagenomics and microbiomics.

Another practical consideration worth mentioning is the following.  Pufferfish is a method to index and query the ccdBG, not to construct it outright.  Pufferfish expects as input a GFA format (currently GFA v1) file containing the ccdBG — this has a segment for every contig, and a path for every reference sequence, and subsequent contigs in a reference path must overlap by exactly k-1.  We rely on other excellent software to generate these GFA files for us.  Currently, we’re using TwoPaCo to generate the ccdBG.  TwoPaCo adopts an interesting multi-stage and highly-parallel algorithm that constructs the compacted dBG directly (without fist constructing the un-compacted dBG).  It works very well, and can compute the cdBG for even relatively large reference sequences quickly and in limited memory.  Unfortunately, all is not perfect with this pipeline, as TwoPaCo makes a few assumptions (e.g. like this, and this) that prevent us from using its GFA directly.  So, we have build a “pufferize” tool that takes the TwoPaCo format GFA and converts it into a GFA that is of the appropriate format to be indexed by pufferfish (there’s actually some interesting considerations here when transforming the “near cdBG” into the cdBG we expect, but that’s fodder for another post).  While this has worked well for all of our testing, it, unfortunately, adds time, resources, and a bit of clunkiness to the whole procedure.  So, we’re also looking for more efficient ways to directly get the input we need — a compacted dBG, in GFA format, where each segment is of length at least k and all segments in a path overlap by exactly k-1 nucleotides, and there are no repeated (canonical) k-mers.  If you know of any software that efficiently outputs this, please let us know!  Finally, we welcome any thoughts or feedback on pufferfish (e.g., ideas for improvements and cool features), and keep a lookout for the pufferfish pre-print, which should be coming very soon!

Edited (07/03/2017) to add some clarifications and fix a few typos.
Edited (07/03/2017, 9:25PM) to update some numbers (puffer got better!).
Edited (07/04/2017, 12:02PM) to update some numbers (puffer got better on the transcriptome!).

References

References
1 running in a single thread on an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Posted in Computational Biology, Science | 4 Comments

The RNA-seq abundance zoo

I’m kicking off this year with a rather different kind of post.  Actually, I started writing this post some time ago (November 19, 2013!), if the draft log of WordPress is to be believed.  I’d been working for some time on Sailfish — our tool for transcript quantification — and I wanted to write a post to lay out some of the many different ways that people characterized relative transcript abundance in RNA-seq experiments.  Originally, the plan was for this to be for my own edification as well as to act as a nice little pointer to which people could refer when they were curious about the specifics of a particular measure, or wanted to know how different measures were related.

Needless to say, I never finished that post, and partly gave up on the idea of it since some other nice posts (like Harold Pimentel’s excellent “What the FPKM?” post) came along.  However, I’ve decided to revive this post with a rather new idea in mind — to make it a “living” post.  That is, I want to start out with the basics I planned to write before (i.e., what are the commonly used measures, what do they mean, and what are some of their benefits and shortcomings).  Subsequently, however, I want to use this post as somewhat of a FAQ on transcript abundance measures and related questions.  I find myself answering common questions about relative abundance fairly frequently (be it in the user group for a piece of our software, or on a forum such as BioStars).  It would be nice to point to the answer here, if it exists, or to create it here otherwise.

Quick link to the FAQ

Meeting the players

Estimating gene (or isoform) abundance from RNA-seq is a fairly active area of research and, like any active area, there are a slew of terms and ideas that are appear in the literature.  However, one of the most basic questions is “how should one report estimates of transcript abundance?”  Ideally, one would like an estimate that reflects the quantity we actually want to know (the underlying relative (more on this later) abundance of different transcripts), while remaining intuitive and comparable across experiments.

The literature contains a variety of different metrics that have been devised to serve this purpose.  Below, I list some of the more popular measures:

  • RPKM — Reads Per Kilobase per Million mapped reads
  • FPKM — Fragments Per Kilobase per Million mapped reads
  • TPM — Transcripts Per Million
  • Read Count — Number of reads estimated to have originated from a transcript
  • CPM — Counts Per Million

In reality,  RPKM and FPKM are very similar.  FPKM just accounts for the fact that fragments, not reads, are what are being drawn from transcripts.  Thus, below, I will not delineate between RPKM and FPKM — rather, I will just refer to *PKM.

Let’s start out with some basics.  Our underling RNA-seq sample consists of a collection of many transcripts.  There are M distinct types or species of transcript.  We denote by c_i the total number of transcripts of type i.  Clearly, the total number of transcripts in our sample is given by \sum_{i=1}^{M} c_i.  Now, here’s a crucial point; our RNA-seq experiment will sample fragments from this population of transcripts.  The fact that our assay is sampling from a population has a few important implications.

First,  in the absence of sampling biases, we expect to sample fragments from this population such that the number of fragments that we draw from each type of transcript i is proportional to the total number of sequenceable fragments that all transcripts of type i might produce.  This implies that we will see many fragments from the most abundant types of transcripts, and many fewer fragments from less abundant transcripts (that’s good — otherwise it would be difficult to estimate abundance with this data).  However, it also implies that (1) very rare transcript types may not generate any samples at all, dependent upon how deeply we sample and (2) the rate at which we expect to sample fragments from transcripts of a given type depends crucially on the length of these transcripts.  For example, a single 200,000 nucleotide transcript is much more likely to generate fragments than a single 2,000 nucleotide transcript, even though there might be only a single copy of each in our underlying population.  The length effect is inherent in the standard RNA-seq protocol, since we expect to sample fragments randomly across the entire length of underlying transcripts.

Second, the process itself produces relative abundance information.  Given the fragments we generate during our experiment, we can say what the relative proportion of the different transcript types are, but we cannot immediately determine e.g., the actual number, c_i, of transcripts of type i in the population. Moreover, we are not guaranteed to see fragments from everything that exists in our population — low abundance transcripts may go completely unsampled.

Given these considerations, we can now think a bit about how, given the fragments sampled in a specific RNA-seq experiment, we might derive estimators of the underlying abundances in which we’re interested.  Consider the following quantities — here, I’m adopting terminology mostly from (Li, Bo, et al. “RNA-Seq gene expression estimation with read mapping uncertainty.” Bioinformatics 26.4 (2010): 493-500.)  Let \tau_i = \frac{c_i}{\sum_{j=1}^{M} c_j} be the fraction of transcripts in the underlying population that are of type i.  Recall that, since our assumption is that we sample from a transcript type proportional to the total number of sequenceable nucleotides for transcripts of that type, this is not the fraction of fragments we expect to derive from transcripts of type i.  Instead, that fraction is given by \eta_i = \frac{c_i \tilde{\ell}_i}{\sum_{j=1}^{M} c_j \tilde{\ell}_j}. Here, we multiply the count of each underlying type of transcript in the population by its effective length, \tilde{\ell_i}. The effective length is akin to the regular length of each transcript type (\ell_i), except that it accounts for the fact that not every transcript in the population can produce a fragment of every length starting at every position.  Actually, a transcript has an effective length with respect to each possible fragment that maps to it.  Consider a transcript of length \ell_i and a fragment mapping to this transcript with a length of \ell_f.  Then, the effective length of this transcript with respect to this fragment is \ell_i - \ell_f + 1.  This is the number of places on transcript i that a fragment of length \ell_f could start, and the probability of any such position (assuming a uniform starting distribution) is just \frac{1}{\left(\ell_i - \ell_f + 1\right)}.

However, it will generally be convenient to talk about the effective length of a transcript independent of any specific fragment.  In that case, what we actually mean is the expected effective length of the transcript (the effective length, in expectation, over all fragments that might derive from it).  This can be computed  as \tilde{\ell_i} = l_i - \mu_{d}^{\ell{l}_i}, where \mu_{d}^{\ell{l}_i} is the mean of the empirical fragment length distribution (the observed distribution of fragments in our sample) truncated to consider only fragments of maximum length \ell_i. Given these fundamental quantities of interest, we can now discuss how they relate to the quantities that are often reported by transcript quantification tools.

TPM  and *PKM

The TPM metric is an estimator of \tau_i.  That is:

\textrm{TPM}_i = 10^6 \cdot \left( \frac{\frac{X_i}{\tilde{\ell_i}}}{\sum_{j=1}^{M} \frac{X_j}{\tilde{\ell_j}}}\right) \propto \tau_i

where X_i is our estimate of the expected number of fragments deriving from transcripts of type i (determining this quantity is one of the fundamental challenges in RNA-seq quantification, and is the raison d’être for transcript quantification methods).  It’s also the case that *PKM is proportional to TPM and, therefore, the transcript fractions.  However, TPM is to be strictly preferred to *PKM, since *PKM contains an essentially arbitrary scaling factor (which is dependent on the average effective lengths of the transcripts in the underlying sample — see, e.g., this blog post).  Though you shouldn’t use it, I’ll include the FPKM formula here for the sake of completeness:

\textrm{FPKM}_i = 10^9 \cdot \frac{X_i}{\left(N \tilde{\ell}_i\right)}

where N is the number of mapped or aligned reads in the sample.

Read Count

So, what about the raw (or estimated) read count?  This is a very easy metric to interpret.  Also, it is clearly related to the other metrics (e.g., it is the X_i) in the TPM equation above.  However, as a measure of relative transcript abundance, it has two particular shortcomings.  First, it is not normalized within sample, and so a read count of 100 in sample A may have a very different meaning from a read count of 100 in sample B (what if sample B has exactly the same composition as sample A, but is sequenced to twice the depth?).  Of course, relative abundance measure cannot, in general, be directly compared across samples.  However, the raw read counts are even “less” normalized than other standard metrics.  Second, this metric is not normalized by length.  Recall how I mentioned above that we expect, a priori, longer transcripts present in the same number to produce more sequenceable fragments than shorter transcripts.  The read count, as a metric, doesn’t take this into account.  So, transcripts present at the same underlying abundance may have drastically different read counts even within the same sample.  While not a good metric for directly comparing or measuring the abundance of transcripts directly, the read count (or estimated read count) is the basis of many derived metrics that attempt to normalize for differences between samples.  This is, in part, because the read count can be judged relative to the sequencing depth (the number of reads in the sample), and therefore one can derive reasonable scaling factors for this metric across samples.

Counts Per Million

The counts per million (CPM) metric takes the raw (or estimated) counts, and performs the first type of normalization I mention in the previous section.  That is, it normalized the count by the library size, and then multiplies it by a million (to avoid scary small numbers).  In equations:

\textrm{CPM}_i = 10^6 \cdot \frac{X_i}{N}

Interestingly, the CPM metric is directly related to the nucleotide fraction \eta_i mentioned earlier in this post.  It essentially measures the rate at which we expect to sample reads from a particular type of transcript.

Frequently Asked Questions (FAQs)

Note: This section is starting out barebones, but I expect it to be updated early and often.

Can I compare TPM / *PKM / CPM across samples?

It depends what you mean by “compare”.  Because these measures are purely relative, you cannot reliably use a metric like TPM to directly assess differences in transcript abundance across samples.  Purely relative metrics are not suited for comparison across samples.  Specifically, changes in the composition of a sample can lead to changes in TPM, even if transcripts are, in reality, expressed at the same true level across samples.  Metrics like this can be useful for “high-level” comparisons (e.g., visualizing samples etc.).  However, whenever using a relative metric like this, one should be aware of its relative nature and the potential caveats that go along with interpreting them between samples.

Does TPM account for “fragments”?

Yes.  That is, a proper implementation of the TPM metric will account for fragments in the way that FPKM does in comparison to RPKM.  Paired-end reads are treated as a single fragment, while in single-end libraries, each read is treated as a single fragment.  I have answered this question in depth (with explanation) on Biostars, here.

What is the “effective length” of a transcript?

See the description earlier in this post

Posted in Uncategorized | Leave a comment

Please feed the bears Using Salmon, Sailfish and Sleuth for differential expression