Data Science for High-Throughput Sequencing

This website accompanies the course Data Science for High-Throughput Sequencing (EE 372 at Stanford).
For questions/comments/typos in the course notes please leave a comment in the notes, submit a pull request directly to our Git repo, or email us at ee372-spr1516-staff _at_ lists.stanford.edu.

Announcements

2 June 2016: The poster session will take place from 3:30pm-5:30pm on June 6 in the Packard atrium. Pins and easels will be provided.
21 May 2016: Assignment 3 released. Due on 1 June 2016 at midnight. This will be the last assignment.
2 May 2016: Assignment 2 released. Due on 9 May 2016 at midnight.
21 April 2016: Project list and guidelines have been posted. Please access this Google Doc to sign up for a 10-minute slot with the TAs during the 27 April 2016 lecture.
8 April 2016: Tutorials on working in the shell and iPython are posted.
8 April 2016: Assignment 1 released. Due on 15 April 2016 at midnight.
30 March 2016: Additional scribing instructions posted under Course Logistics and Overview.
30 March 2016: Stephen Turner, Co-founder and CTO of PacBio, will be giving a guest lecture on 13 April 2016.
30 March 2016: Bikash Sabata, VP of software at Genia, will be giving a guest lecture on 6 April 2016.
29 March 2016: Please access this Google Doc to sign up for scribing a lecture.

Course Description

Extraordinary advances in sequencing technology in the past decade have revolutionized biology and medicine. Many high-throughput sequencing based assays have been designed to make various biological measurements of interest. This course explores the various computational and data science problems that arises from processing, managing and performing predictive analytics on high throughput sequencing data. Specific problems we will study include genome assembly, haplotype phasing, RNA-Seq assembly, RNA-Seq quantification, single cell RNA-seq analysis, multi-omics analysis, and genome compression. We attack these problems through a combination of tools from information theory, combinatorial algorithms, machine learning and signal processing. Through this course, the student will also get familiar with various software tools developed for the analysis of real sequencing data.

Lectures times

Monday, Wednesday 3:00 PM - 4:20 PM at McCullough 115
Lab hour: Friday (exact time and location TBA)

Course Staff

Instructor: David Tse (dntse _at_ stanford.edu)
Teaching assistants: Govinda Kamath (gkamath _at_ stanford.edu) , Jesse Zhang (jessez _at_ stanford.edu)
Office hours: 4:20pm-5:05pm MW at Packard 264 for instructor, 1:45pm-2:45pm M at Packard 260 for teaching assistants

Course Materials

Course Logistics and Overview Course Outline Lecture 1: Introduction Lecture 2: Biological Background and Sequencing by Synthesis Assignment 1 Lecture 3: Base Calling for Second-generation sequencing Lecture 4: Nanopore Sequencing Technology (Guest Lecture) Lecture 5: Assembly - An Introduction Lecture 6: Pacific Biosciences Sequencing Technology (Guest Lecture) Project Guidelines and Ideas Lecture 7: Assembly - Necessary Conditions for Successful Assembly Lecture 8: Assembly - The de Bruijn Graph Algorithm Lecture 9: Assembly - Multibridging and Read-Overlap Graphs Assignment 2 Lecture 10: Alignment - Introduction and Errors Lecture 11: Alignment - Dynamic Programming and Indexing Lecture 12: Haplotype Assembly - Introduction and Convolutional Codes Lecture 13: Haplotype Assembly - Community Detection Lecture 14: Wrapping up Haplotype Assembly and Introduction to RNA-seq Lecture 15: RNA-seq - Quantification and the EM algorithm Assignment 3 Lecture 16: RNA-seq - Hard EM and De Novo Transcriptome Assembly Lecture 17: RNA-seq - De Novo Transcriptome Assembly and Single-Cell RNA-seq

Section Materials

Lab Hour 1