Lecture 11: Alignment - Dynamic Programming and Indexing
Monday 25 April 2016
Scribed by the course staff
In the last lecture, we introduced the alignment problem where we want to compute the overlap between two strings. Today we will talk about a dynamic programming approach to computing the overlap between two strings and various methods of indexing a long genome to speed up this computation.
- Review of alignment
- Dynamic programming
- Genome indexing
When working with reads, we are generally interested in two types of alignment problems.
- Reference-based SNP/variant calling, where reads are aligned to a reference. We are often interested in computing the alignment for a billion reads to a long reference genome.
- De novo assembly, where reads are aligned to each other
To compute the optimal alignment between to genomic sequences (or more generally strings), we can find the minimal edit distance between the two sequences. We note that for both of the above problems, a lot of computation is repeated using the same data. As a recap, the figure below finds the minimal edit distance alignment between two strings X = ‘GCGTATGTG’ and Y = ‘GCTATGCG’. Recall that for the standard edit distance problem, we assign a substitution, deletion, or insertion equal penalties.
To go from X to Y, we need at least one deletion and one substitution. Therefore the edit distance is 2.
Dynamic programming is the strategy of reducing a bigger problem into multiple smaller problem such that solving the smaller problems will result in solving the bigger problem. First, we need to define the “size” of a problem. For edit distance, we let represent the problem of computing the edit distance between and . is the length- prefix of the string , and is the length prefix of string . If we let represent the length of and represent the length of , then the edit distance between and is the solution to problem . The claim is: if we can solve all the problems for and , then we will efficiently obtain a solution for problem .
Let equal the edit distance between and . Suppose we are looking at = ‘GCG’ and = ‘GC’. The edit distance between these two is 1. To express this as even smaller problems, we need a key insight: problem can be solved directly using the solutions from 3 subproblems:
- , where we advance by one character and put an empty symbol in .
- , where we advance by one character and put an empty symbol in
- , where we advance both and by 1 character.
Since we are interested in the minimal edit distance,
where represents an indicator function
Thus is 1 th character of is is the same as the th character of , and 0 otherwise.
We can think of solving this problem as filling in the entries of a table where the columns correspond to the empty string plus the characters in , and the rows corerspond to the empty string plus the characters in . Please see the figure below for the filled out table corresponding to the first example in this lecture.
Note that the th entry (bottom-right corner) indeed has the minimal edit distance between the two strings. After filling out the dynamic programming table, we can trace a path back from the bottom-right entry to the top-left entry along the smallest values to obtain our alignment shown in the first figure in this lecture.
Each computation requires looking up at most 3 entries of the table. Therefore the complexity of this algorithm is . For read-overlap graph assembly, we have reads each of length . Using this edit distance approach, we will need operations to perform assembly. With possibly the order of or for a sequencing experiment, this operation is quite expensive.
For variant calling, we want to align reads to a reference genome of length . Computing the edit distance between a read and the genome is . The runtime is where is the coverage depth. can be large ( for the human genome) and therefore this operation is also quite expensive.
The genome is very long, so alignment algorithms that require a search along the entire genome (for every read) seem suboptimal. We can use an index to store the context of the genome to make the search more efficient, reducing the amortized cost.
We first consider the idealistic case. Suppose each read is error-free; the reads come directly from the genome. Every read has a location in this genome, and we want to find this location fast. We can build a sorted list of -mers (each read is length ) such that we can easily look up the genome indices of specific -mers. We build the list by extracting all length sequences from the genome and ordering the sequences in lexicographical order. For each new read coming in, we can quickly search for the corresponding -mer key in the list, obtaining its location in the genome.
Since the list is at most length , doing a binary search on the list is . Note that the complexity of building the index is , but we only have to build this index once. Looking up reads is . Overall, the total cost is , much better than .
Rather than using a sorted list, we could also use a hash table. This results in lookup for new reads, but keeping the entire hash table in memory is more expensive.
We now look at the regime where the errors are relatively low (e.g. less than 1%, in the case of Illumina reads). Recall from last lecture that when the error is low, we typically get 0-2 errors per length-100 read. We will see many legnth-20 subsequences which are error free. So instead of making a sorted list with -mers, we can create a sorted list of -mers. With this strategy, -mers with errors will map to a random location or no location at all. This is not an issue, however, because if 20 is a reasonably large number, then most random reads will map to no location in the genome.
If the error rate is high, we can use the same approach as the low-error case by reducing the size of the corresponding to the -mer used for indexing. If the error is high enough (15% in the case of PacBio reads), will need to be small to ensure a high probability of errors mapping to no location on the genome. With a smaller , we also expect more collisions in our table, since the set of unique -mers is smaller. More collisions results in a more expensive alignment procedure.
Instead, we can perform minhashing in this regime to efficiently find length- regions of the genome similar to a given -mer.
The seed-and-extend concept is based on this fingerprinting approach. We starting by finding seed locations using k-mers, and use these to identify potential matching regions in the genome. Then one aligns using the dynamic program in these regions. Almost all practical tools use this approach.
In the low-error case, tools used in practice use short subsequences (of length 20-30) as potentially error-free fingerprints for looking down potential alignment locations. Given a read they find hash matches for each 20-mer in the read, tools like Bowtie compute a set of potential locations a read could match. They then compute alignments by dynamic programming in those locations. Some tools like Kallisto take an intersection of the set of potential matches returned by the k-mers in a read (shifted appropriately) to obtain locations in the reference the reads could have come from. (The actual algorithm is a little more subtle where the intersection is taken only between k-mers that have an entry in the hash table. The underlying assumption is that k-mers with 1-2 errors will not appear anywhere else in the reference.)
Read overlap graph approaches typically require operations where or . We can use the fingerprinting idea to alleviate some of the cost. We can build a table where the keys are -mers and values are the reads containing a particular -mer. We build this table by scanning through all the reads and applying a hash function. Now, for each read, we want to find a bunch of other reads that may align to the query read. Using this hash strategy gives us far less “candidate” reads per read, saving significant computation. The actual savings will depend on the number of repeats, but the cost reduces from to for some constant . This is done in practice by assemblers like DAligner, Minimap and MHAP.