Data Science for High-Throughput Sequencing

This website accompanies the course EE 372: Data Science for High-Throughput Sequencing.
For questions/comments/typos in the course notes please leave a comment in the notes, submit a pull request directly to our Git repo, or email us. Click here for last offering's course website.

Announcements

19 March 2018: All assignment solutions now posted and accessible from the assignment pages.
14 March 2018: Project abstracts posted.
12 March 2018: Assignment 3 deadline extended to Friday 16 March 2018 at 11:59pm.
28 February 2018: Assignment 3 released. Due on Wednesday 14 March 2018 at 11:59pm. Submission through Gradescope.
6 February 2018: Assignment 2 released. Due on Tuesday 20 February 2018 at 11:59pm. Submission through Gradescope.
29 January 2018: Project guidelines handout posted. Please sign up for an office hour slot here.
17 January 2018: David's office hour on Thursday 18 January will be changed to Friday 19 January from 3:00-4:00pm. Govinda and Jesse will be holding office hours from 4:00-5:00pm Friday 19 January on the 3rd floor of Packard (kitchen area).
17 January 2018: Please fill out this Google doc for final project groups.
17 January 2018: Assignment 1 released. Due on Friday 26 January 2018 at 11:59pm. Submission through Gradescope (entry code: M5V4JJ).
9 January 2018: Course Description handout posted.
8 January 2018: Course Outline posted.

Course Description

Extraordinary advances in sequencing technology in the past decade have revolutionized biology and medicine. Many high-throughput sequencing based assays have been designed to make various biological measurements of interest. This course explores the various computational and statistical problems that arises from processing high throughput sequencing data. Specific problems we will study include genome assembly, haplotype phasing, RNA-Seq quantification, single cell RNA-seq analysis, etc. Specific techniques we will learn to solve these problems include spectral algorithms, dynamic programming, the EM algorithm, PCA, FDR, etc. Through this course, the student will also get familiar with various software tools developed for the analysis of real sequencing data.

Course Staff

Instructor: David Tse (dntse _at_ stanford.edu)
Teaching assistants: Govinda Kamath (gkamath _at_ stanford.edu) , Jesse Zhang (jessez _at_ stanford.edu)
Office hours: Mon 3:00-4:00pm and Thurs 3:15-4:15pm at Packard 264 for instructor, Mon 11:00am-12:00pm at Packard 104 for teaching assistants

Lectures times

Tuesday, Thursday 1:30-2:50pm at 540-108

Grading

Class participation: 10%
Scribing: 10%
Problem sets (3-4) : 30%
Project: 50%

Useful References

Ben Langmead's lecture notes
Bioinformatics algorithms by Compeau and Pevzner

Course Materials

Course Outline Lecture 1: Introduction Lecture 2: Basics of DNA & Sequencing by Synthesis Lecture 3: Base Calling for Second-Generation Sequencing Lecture 4: Base Calling for Next-Generation Sequencing Lecture 5: Assembly - An Introduction Lecture 6: Assembly - Greedy Algorithm Lecture 7: Assembly - De Bruijn Graph Lecture 8: Assembly - Towards a Long Reads Assembler Lecture 9: Alignment - Dynamic Programming and Indexing Lecture 10: Haplotype Phasing - Community Recovery Lecture 11: Haplotype Phasing - Spectral Stitching Lecture 12: RNA-seq - A Counting Problem Lecture 13: RNA-seq - Quantification and the EM Algorithm Part 1 Lecture 14: RNA-seq - Quantification and the EM Algorithm Part 2 Lecture 15: Differential Analysis and Multiple Testing Lecture 16: Single Cell RNA-Seq - Introduction Lecture 17: Multiple Testing Lecture 18: Empirical Bayes and Single-Cell RNA-Seq Analysis Lecture 19: Single-Cell RNA-Seq - Clustering

Assignments

Assignment 1 Assignment 2 Assignment 3

Project

Project Guidelines and Ideas Project Abstracts