Assignment 3
posted Saturday 21 May 2016
due Wednesday 1 June 2016 at midnight
Submission policy: Report all plots and your code in this iPython notebook. Print your notebook as a PDF and attach it to the rest of your assignment. Turn in your assignment on the 2nd floor of Packard in the EE372 bin next to the kitchen area.
Question I: Minhashing
In class we discussed briefly how minhashing can be used for aligning reads with large error rates. In this question, we will explore the minhashing concept.

We can describe a read as a set of unique overlapping mers, and we would expect similar reads to have similar sets. Write a function that takes and a read as inputs and outputs a dictionary indicating which of the mers are in the read.

If we think of each read as a set of mers, a natural metric of similarity between two reads and is the Jaccard similarity, which is defined as \[ J(R_1,R_2) = \frac{R_1 \cap R_2}{R_1+R_2R_1 \cap R_2}. \] Explain how this metric captures similarity between two sets and how you might use this metric to align reads (23 sentences). Compute the minimum and maximum possible Jaccard similarity between any two sets.

Write a function to compute the Jaccard similarity between two dictionaries outputted by your function from part 1. Using this function and the one you wrote for part 1, compute the Jaccard similarity between the reads CATGGACCGACCAG and GCAGTACCGATCGT for . What is the runtime complexity of your function? If you have reads of length each, what is the worstcase runtime complexity for computing the Jaccard similarity between every pair of reads?

Suppose you have a function that hashes a mer to a value between and . For minhashing, you would use this hash function to map each unique mer in a read to an index, ultimately returning the smallest index. Prove that the probability that two sets will generate the same minhash index is equal to their Jaccard similarity.

In practice, we would use multiple hash functions to compute multiple minhash indices for and . Write down an estimator that uses the minhash indices to estimate . What is the runtime complexity of obtaining this estimation? How does this compare to the runtime you obtained for part 3?
Question II: Haplotype phasing coverage
In this problem we examine a simplified version of the haplotype assembly problem. We assume that a genome has SNPs, and each SNP is heterozygous. Every matepair read covers a pair of adjacent SNPs.

We assume that we know the exact position of the read on the genome by aligning the read to the genome. How many different types of reads do we obtain (restricting ourselves to their values at the SNP positions) if there were no errors?

Given that we obtain matepair reads, argue that for large and , the number of reads covering each adjacent pair of points .

If each position of each matepair read has an error rate of , argue that the number of erroneous pairwise measurements .

Given that there are reads covering a pair of consecutive positions, argue that the probability that a majority of the reads being wrong is approximately where represents the KL divergence between a distribution and a distribution. Hint: The probability of .

Using earlier parts or otherwise, compute an upper bound on the average probability of error in estimating the parity between SNR and , in terms of , , and .

Our final goal is to phase the entire genome correctly. Using part 6 or otherwise, compute a bound on the probability of error in phasing the entire genome.

For a given desired probability of phasing error , use the bound in part 7 to give an expression for , the number of matepair reads needed.

For , plot as a function of , the number of SNP’s. How does scale with as grows? Linearly, sublinearly or superlinearly? Can you give an intuitive explnation for your answer?

For , plot as a function of , the read error rate.
Question III: RNAseq Quantification
In class we discussed the basic EM algorithm for RNAseq quantification in the simple case when the transcript lengths are all the same and there are no errors in the read. In this question, we will consider extensions to unequal transcript lengths and read errors. We start with the same RNAseq model as discussed in class.

Implement the EM algorithm for the errorfree case where all transcripts have the same length. Fill in the code in the last cell of the ipython notebook.
 Instead of equal transcript length , let us now consider the case when the transcript lengths are . The reads are still errorfree.
 Develop the log likelihood model
 Derive the EM iterative algorithm for this model, specializing from the general EM discussed in the lecture.
 Now suppose the reads have errors: each base is read correctly with probability and incorrectly with probability , and if incorrect the base can be read equally likely as any of the other three possibilities.
 Generalize the log likelihood model in Part 1 to this case.
 Derive the EM iterative algorithm for this model, again specializing from the general EM algorithm.
 Suppose the alignment tool at your disposal can compute all exact alignments and all approximate alignments up to one base different. If the error rate is small such that the chance of a read having two or more errors is negligible, explain how you would use your alignment tool in implementing the EM algorithm derived in the previous part.