Lecture 13: Haplotype Assembly - Community Detection
Wednesday 11 May 2016
Scribed by Christian Choe and revised by the course staff
In the last lecture we introduced the haplotype assembly problem. By casting the problem as a convolutional coding problem, we can use Viterbi decoding to arrive at a solution. This approach suffers from an exponential runtime in terms of the number of SNPs between mate-pairs. In this lecture, we will take a different approach based on the community recovery problem.
- Community recovery problem
- Spectral method
- Simplifying the haploid phasing problem
In the community recovery problem, we are given a graph with a bunch of nodes, and each node belongs to one of multiple clusters as shown in the figure below. The recovery problem is to recover the clusters (colors) based on the edge information between nodes. This problem is commonly seen in social networks where nodes can be blog posts, for example, and we want to identify which posts are from Republicans and which are from Democrats. The edges describe how the posts link to one another.
When sequencing a haploid organism, we obtain a set of heterozygous SNPs such as the one shown in the figure below. ‘0’ and ‘1’ represent the different SNPs.
If we represented this as a graph, we would have four nodes corresponding to the 4 heterozygous SNP locations. We can also define two communities and for our graph:
Nodes and belong to community , and node belongs to community . In practice, we may have 100,000 nodes with half in each cluster. Notice that recovering the communities is equivalent to solving the haplotype phasing problem. The partition will give us all the information except for 1 bit: which SNPs correspond to the maternal chromosome.
We will let denote the class of node and . Let represent the edge data between nodes and where if and -1 otherwise. We will set if there is no linking reads between node i and node j. If there is no noise, then corresponds to an edge where the two nodes are in the same community and vice versa. would indicate the two nodes being in different communities. The figure below illustrates the notation introduced so far.
When working with real read data, we can think of each measurement (mate-pair read) as a noisy edge on the graph telling us if two nodes are linked. We introduce a random variable to represent the noise in each edge. We assume that all are i.i.d. In summary,
essentially tells us if two nodes are in the same community with an probability of error. Exploiting vector notation, we further define:
To solve the community recovery problem, we will use maximum likelihood to infer the ’s’:
where indicates the set of edges in the observed data. This can be further simplified by using the log likelihood:
where each log term can be expressed as
Since is a constant, we can further simplify the ML decoder to
Notice that we threw in the edges that are not in the observed set of edges. We can set for these cases. Ultimately, we want to compute the maximization using this quadratic form. Brute force maximization of this is quite bad because the number of possible ’s is .
Intuitively, when two nodes are in the same community () and there is no error (), is positive, giving us a positive contribution to our maximization objective. We do not want negative terms. We can decompose the sum in objective into:
This is a combinatorial optimization problem. Suppose we are solving a simpler problem where we only have edges. Then the objective becomes
While the number of edges is fixed, the number of cross edges depends on the clustering. Therefore the problem becomes: find a partition of the graph that maximizes the number of cross edges. This is the max cut problem, which is NP hard. If we approach the problem from a general approach, it’s NP hard. We will need to exploit some further structure in the problem.
In order to solve this NP hard combinatorial optimization problem, we can use the spectral method to arrive at an approximate solution. We relax the problem by allowing each to be real. We will also constrain . We can bound the optimization problem as follows:
Because is a symmetric matrix, its eigenvalues are real and positive. We simply set to equal the eigenvector corresponding to the largest eigenvalue of . By taking the sign of each entry in , we get an approximate solution to our original problem (where ) with a reasonable . This approach is called the spectral method because we pick according to the spectrum of .
Because we relaxed the original problem, we need to exhibit some evidence that this approach is good. Consider the following random graph: for every pair of points, we draw an edge between them with probability . Note that Y is a random matrix because the location of the measurements are random and the errors are random. We want to first check what happens when is replaced by its expected value. Intuitively, if this method does not work when is deterministic at the mean, then there’s not much hope of this method working in general.
is a rank 1 matrix, and applying the spectral method on this matrix will give us exactly , our ground truth. The hope is that while in actuality is random, statistically it’s close to its mean . This shows that using the spectral method, we can expect to get a reasonable answer.
The solution obtained for using the spectral method will be correct in a large number of entries. We can clean up the entries a bit by considering the neighbors of each node. We set each node to the majority community amongst its neighbors. Since most of the nodes are correct, this clean-up step improves our solution.
When dealing with real heterozygous SNPs, the linking information will be constrained to ranges of ~100 kbp (e.g. 10x technologies) while the chromosomes they reside on are each ~100 Mbp. Since the links are localized, unlike a random graph, we can section the chromosome into segments of length and analyze each segment as shown in the figure below. For small values of we can use Viterbi decoding, but for large values we can use the spectral method.