This website accompanies the course Data Science for High-Throughput Sequencing (EE 372 at Stanford).
For questions/comments/typos in the course notes please leave a comment in the notes, submit a pull request directly to our Git repo, or email us at ee372-spr1516-staff _at_ lists.stanford.edu.
For questions/comments/typos in the course notes please leave a comment in the notes, submit a pull request directly to our Git repo, or email us at ee372-spr1516-staff _at_ lists.stanford.edu.
Announcements
- 2 June 2016: The poster session will take place from 3:30pm-5:30pm on June 6 in the Packard atrium. Pins and easels will be provided.
- 21 May 2016: Assignment 3 released. Due on 1 June 2016 at midnight. This will be the last assignment.
- 2 May 2016: Assignment 2 released. Due on 9 May 2016 at midnight.
- 21 April 2016: Project list and guidelines have been posted. Please access this Google Doc to sign up for a 10-minute slot with the TAs during the 27 April 2016 lecture.
- 8 April 2016: Tutorials on working in the shell and iPython are posted.
- 8 April 2016: Assignment 1 released. Due on 15 April 2016 at midnight.
- 30 March 2016: Additional scribing instructions posted under Course Logistics and Overview.
- 30 March 2016: Stephen Turner, Co-founder and CTO of PacBio, will be giving a guest lecture on 13 April 2016.
- 30 March 2016: Bikash Sabata, VP of software at Genia, will be giving a guest lecture on 6 April 2016.
- 29 March 2016: Please access this Google Doc to sign up for scribing a lecture.
Course Description
Extraordinary advances in sequencing technology in the past decade have revolutionized biology and medicine. Many high-throughput sequencing based assays have been designed to make various biological measurements of interest. This course explores the various computational and data science problems that arises from processing, managing and performing predictive analytics on high throughput sequencing data. Specific problems we will study include genome assembly, haplotype phasing, RNA-Seq assembly, RNA-Seq quantification, single cell RNA-seq analysis, multi-omics analysis, and genome compression. We attack these problems through a combination of tools from information theory, combinatorial algorithms, machine learning and signal processing. Through this course, the student will also get familiar with various software tools developed for the analysis of real sequencing data.
Lectures times
Monday, Wednesday 3:00 PM - 4:20 PM at McCullough 115 Lab hour: Friday (exact time and location TBA)
Course Staff
Instructor: David Tse (dntse _at_ stanford.edu) Teaching assistants: Govinda Kamath (gkamath _at_ stanford.edu) , Jesse Zhang (jessez _at_ stanford.edu)
Office hours: 4:20pm-5:05pm MW at Packard 264 for instructor, 1:45pm-2:45pm M at Packard 260 for teaching assistants
Course Materials
Course Logistics and Overview
Course Outline
Lecture 1: Introduction
Lecture 2: Biological Background and Sequencing by Synthesis
Assignment 1
Lecture 3: Base Calling for Second-generation sequencing
Lecture 4: Nanopore Sequencing Technology (Guest Lecture)
Lecture 5: Assembly - An Introduction
Lecture 6: Pacific Biosciences Sequencing Technology (Guest Lecture)
Project Guidelines and Ideas
Lecture 7: Assembly - Necessary Conditions for Successful Assembly
Lecture 8: Assembly - The de Bruijn Graph Algorithm
Lecture 9: Assembly - Multibridging and Read-Overlap Graphs
Assignment 2
Lecture 10: Alignment - Introduction and Errors
Lecture 11: Alignment - Dynamic Programming and Indexing
Lecture 12: Haplotype Assembly - Introduction and Convolutional Codes
Lecture 13: Haplotype Assembly - Community Detection
Lecture 14: Wrapping up Haplotype Assembly and Introduction to RNA-seq
Lecture 15: RNA-seq - Quantification and the EM algorithm
Assignment 3
Lecture 16: RNA-seq - Hard EM and De Novo Transcriptome Assembly
Lecture 17: RNA-seq - De Novo Transcriptome Assembly and Single-Cell RNA-seq
Section Materials