This website accompanies the course EE 372: Data Science for High-Throughput Sequencing.
For questions/comments/typos in the course notes please leave a comment in the notes, submit a pull request directly to our Git repo, or email us. Click here for last offering's course website.
For questions/comments/typos in the course notes please leave a comment in the notes, submit a pull request directly to our Git repo, or email us. Click here for last offering's course website.
Announcements
- 19 March 2018: All assignment solutions now posted and accessible from the assignment pages.
- 14 March 2018: Project abstracts posted.
- 12 March 2018: Assignment 3 deadline extended to Friday 16 March 2018 at 11:59pm.
- 28 February 2018: Assignment 3 released. Due on Wednesday 14 March 2018 at 11:59pm. Submission through Gradescope.
- 6 February 2018: Assignment 2 released. Due on Tuesday 20 February 2018 at 11:59pm. Submission through Gradescope.
- 29 January 2018: Project guidelines handout posted. Please sign up for an office hour slot here.
- 17 January 2018: David's office hour on Thursday 18 January will be changed to Friday 19 January from 3:00-4:00pm. Govinda and Jesse will be holding office hours from 4:00-5:00pm Friday 19 January on the 3rd floor of Packard (kitchen area).
- 17 January 2018: Please fill out this Google doc for final project groups.
- 17 January 2018: Assignment 1 released. Due on Friday 26 January 2018 at 11:59pm. Submission through Gradescope (entry code: M5V4JJ).
- 9 January 2018: Course Description handout posted.
- 8 January 2018: Course Outline posted.
Course Description
Extraordinary advances in sequencing technology in the past decade have revolutionized biology and medicine. Many high-throughput sequencing based assays have been designed to make various biological measurements of interest. This course explores the various computational and statistical problems that arises from processing high throughput sequencing data. Specific problems we will study include genome assembly, haplotype phasing, RNA-Seq quantification, single cell RNA-seq analysis, etc. Specific techniques we will learn to solve these problems include spectral algorithms, dynamic programming, the EM algorithm, PCA, FDR, etc. Through this course, the student will also get familiar with various software tools developed for the analysis of real sequencing data.
Course Staff
Instructor: David Tse (dntse _at_ stanford.edu) Teaching assistants: Govinda Kamath (gkamath _at_ stanford.edu) , Jesse Zhang (jessez _at_ stanford.edu)
Office hours: Mon 3:00-4:00pm and Thurs 3:15-4:15pm at Packard 264 for instructor, Mon 11:00am-12:00pm at Packard 104 for teaching assistants
Lectures times
Tuesday, Thursday 1:30-2:50pm at 540-108 Grading
- Class participation: 10%
- Scribing: 10%
- Problem sets (3-4) : 30%
- Project: 50%
Course Materials
Course Outline
Lecture 1: Introduction
Lecture 2: Basics of DNA & Sequencing by Synthesis
Lecture 3: Base Calling for Second-Generation Sequencing
Lecture 4: Base Calling for Next-Generation Sequencing
Lecture 5: Assembly - An Introduction
Lecture 6: Assembly - Greedy Algorithm
Lecture 7: Assembly - De Bruijn Graph
Lecture 8: Assembly - Towards a Long Reads Assembler
Lecture 9: Alignment - Dynamic Programming and Indexing
Lecture 10: Haplotype Phasing - Community Recovery
Lecture 11: Haplotype Phasing - Spectral Stitching
Lecture 12: RNA-seq - A Counting Problem
Lecture 13: RNA-seq - Quantification and the EM Algorithm Part 1
Lecture 14: RNA-seq - Quantification and the EM Algorithm Part 2
Lecture 15: Differential Analysis and Multiple Testing
Lecture 16: Single Cell RNA-Seq - Introduction
Lecture 17: Multiple Testing
Lecture 18: Empirical Bayes and Single-Cell RNA-Seq Analysis
Lecture 19: Single-Cell RNA-Seq - Clustering
Assignments