Lecture 15: RNA-seq - Quantification and the EM algorithm
Wednesday, May 18, 2016
Scribed by Paraskevas Deligiannis and revised by the course staff
In the previous lecture, we introduced RNA-seq and the quantification problem. In this lecture, we dive deeper into this problem – in the case when we know the RNA transcripts beforehand – and examine how we can solve it optimally under some simplicity assumptions.
- The RNA quantification problem & our assumptions
- Naive split reads
- Improved approach: Iterative estimation refinement
- Maximum Likelihood estimation
- Expectation-Maximization (EM) algorithm
As discussed in the last lecture, the RNA-seq data consist of multiple reads sampled from the various RNA transcripts in a given tissue (after the reverse transcription of RNA to cDNA). We assume that these reads are short in comparison to the RNA transcripts and have equal length .
We, also, assume that we know the list of all possible RNA transcripts beforehand. Every read is mapped (using alignment) to (possibly multiple) transcripts. Our goal is to estimate the abundance of each transcript where denotes the fraction of among all transcripts.
Additionally, we make the following assumptions for the sake of simplicity:
- All transcripts have equal length . It is fairly straightforward to extend our analysis to transcripts of unequal length.
- The reads come from uniformly sampling all the transcripts. This is a relatively mild assumption we have made before to ease our analysis, even though it is not entirely accurate.
- Each read can come from at most one location on each transcript. This is a reasonable assumption, since different exons rarely contain common subsequences.
As discussed in the previous lecture, the difficulty of the above counting problem lies in the existence of multiply-mapped reads. This occurs due to the fact that different transcripts can contain the same exons.
One approach to deal with these reads would be to split them, i.e. assign a fractional count to each transcript a read maps to. The simplest way to do this is by splitting the read equally among all transcripts it is mapped to. For instance, if a read maps to 5 transcripts, then each of them gets a count of .
Finally, our estimate of the abundance of with total count is
While naive read splitting sounds reasonable, it can fail spectacularly as illustrated by the following example.
Here, we have two transcripts sharing a common exon B and our ground truth is that and . As a result, some of our reads come from exon A and some come from exon B. Let’s assume that half of the reads come form each exon (even though the figure above does not depict the two exons as of equal length).
All the reads coming from exon A map uniquely to and thus, they contribute a total of to the count of . All the reads coming from exon B map to both transcripts and according to the naive algorithm above, each of them contributes to each of the counts As a result, our estimate is that
which is clearly wrong.
Since the naive algorithm fails, we need to come up with a better solution. We note that despite the failure, we came closer to the truth (in comparison to a random guess of equal abundances). Is there a way to leverage the information we gained?
What if we were to use our newly obtained estimate of abundances to re-split the multiply-mapped reads with weights proportional to the relative abundances of the two transcripts? In our example, this would mean that
which is closer to the truth.
But now, we can simply repeat the process using the updated estimate of the abundances. It is easy to see that at each step, will be halved and hence, this process converges to the ground truth at an exponential rate.
This seems promising. So, let’s formulate this algorithm more generally.
Since we know nothing about the abundances to begin with, our initial estimate is uniform. That is
For step repeat:
- For let read map to to a set of transcripts, denoted by .
Then, split into fractional counts for each transcript , equal to the relative abundances of the transcripts in ,
- The updated estimate is, obviously,
- For let read map to to a set of transcripts, denoted by . Then, split into fractional counts for each transcript , equal to the relative abundances of the transcripts in , as follows:
The natural questions that arise now are
- whether this algorithm converges in general and
- even if it does, does it converge to the Maximum Likelihood (ML) estimate ?
The former is crucial, since a non-converging process is, basically, useless. The latter examines the performance of the algorithm in terms of estimation accuracy. Given that we have a finite amount of data, we cannot guarantee that the ground truth is recovered. Our best hope is to find .
This leads us to our next few topics. What is and how can we compute it efficiently?
Let be the vector of transcript abundances. The distribution of the reads depends on . Since the reads are sampled independently, we have that
We note that if and only if the read maps to a transcript that contains the sequence and starts from the same position as . (Recall our assumption that each read can come from at most one location on each transcript.) Therefore, the probability
The reads are sampled uniformly, so . Also, each transcript has length and each read of length L coming from has an equal chance of starting at each of the possible positions, hence
As a result,
which means that
Maximizing the likelihood function is equivalent to maximizing the log-likelihood, and since is a constant, the ML estimation for is
If we define the matrix with elements , then
Along each row of matrix the positions with value 1 represent the transcripts to which maps. Along each column of , the positions with value 1 represent the reads that map to .
As we can see, the effect of the data (reads) on the ML estimate can be summarized by matrix , or equivalently, its entries . Therefore, is a sufficient statistic for this ML inference problem.
The above maximization problem is a convex optimization problem, which is desired, since convex problems are well studied and there exist tools to solve them efficiently.
The convexity arises from the combination of the following factors:
it involves the maximization of a concave objective function, as
is an affine function of
the composition of a concave with an affine function is concave
the sum of concave functions is concave
both and are linear constraints
Let’s consider the special case when there are no multiply-mapped reads, i.e. every read maps to at most one transcript, or equivalently, each row of matrix has at most one element with value 1. If we denote by the number of reads that map to transcript , then the above maximization problem becomes
This problem is easy to solve by equating to 0 the gradient of the objective function w.r.t and using Lagrange multipliers. The result is obvious and reassuring:
After all, we knew that the naive approach with split reads works when all reads are uniquely mapped. Note that when all reads map to exactly one transcript, .
The EM algorithm was thoroughly described in a 1977 article written by Arthur Dempster, Nan Laird, and Donald Rubin. It provides an iterative method to compute – locally optimal – ML parameters of a statistical model. If denotes the unknown parameters to be estimated (in our case, ) and are the observed variables (data, in our case, the reads), then the EM algorithm computes
The basic observation is that in many cases of statistical inference, the problem would become significantly easier, or even trivial, if we knew some unobserved hidden variable(s) . For instance, in our case, if we somehow knew for every read which transcript it came from, then our problem would reduce to the simple problem with uniquely mapped reads. This information can be represented by a matrix such that
The idea behind EM is to iteratively try to use and the current estimate to compute the posterior distribution: . Then we update our estimate to maximize the expected value of the likelihood w.r.t. and so on:
It is proven that the EM algorithm always improves the likelihood at every iteration. Since is bounded above by 1, this means that the EM algorithm converges to a local maximum.
However, we saw from our ML analysis in a previous section that the ML optimization problem is concave. This means that there is only one local maximum, the global one. Thus, the EM algorithm is guaranteed to find the optimal ML estimate.
Let’s formally describe the EM algorithm. The algorithm consists of two successive steps, the expectation step (E step) and the maximization step (M step), after which it is named. These steps are repeated until the algorithm converges.
E step: Compute the expected value of the log-likelihood function w.r.t the conditional distribution of using the current estimate of the unknown parameters :
M step: Find the updated estimate that maximizes :
In our case, it is easy to see that the probability that a read comes from the transcript is proportional to the relative abundance of in . Namely, given our reads and our current estimate ,
We have that
where in the first step, we note that since represents the ground truth, there is at most one non-zero term in the sum (similar to the case with uniquely mapped reads). We drop the indicator function in the second step assuming that and Z are consistant.
where we have omitted the linear constraints on for simplicity. One can verify that if we define , the above results follow readily from our analysis for the case of uniquely mapped reads. This is, after all, the reason why we use the hidden variable .
The EM algorithm described above is a type of soft EM in the sense that at each iteration, it calculates and uses only an estimate of the distribution of given .
In contrast, a hard EM algorithm would compute an estimate of the values of and then use it to assign each read to the transcript that is most abundant according to the current estimate . The hard algorithm commits a lot more to its estimates than the “soft” alternative but seems simpler to implement.
Given its possible benefits over the soft version, the questions that arise are:
- Will it work for the example of section shortcomings of naive splitting where there is a common exon in two different transcripts?
- Will it work in general?