**Organisation:** CS course number: 16:198:674:01, CBMB course number: 16:118:617:03. Time: Thursdays 3:20-6:30pm. Room: Hill 264.

**NOTE:** This course is designed at the 500 level for first-year graduate students and advanced undergraduate students interested in modern biology applications or, more generally, interested in machine learning or statistical algorithms (e.g. Hidden Markov Models). No biology background is required. Even though the class has a 674-number it may be used to satisfy the Category B requirement.

**Description:** The field of Bioinformatics is primarily concerned with the analysis of data from molecular biology using methods from computer science---algorithms and machine learning---and from computational statistics. Its development reflects the immense continuing change of biology and the rapid advances in experimental techniques, exemplified by the invention of DNA sequencing only 36 years ago, the completion of the Human genome not quite a decade ago and our personal genome sequences in the very near future. The biological questions we will answer range from deciding whether two proteins have a common ancestor and how we rapidly identify such proteins in large databases to assembly of genomes.

**Topics covered:**

- Sequence comparison: pair-wise sequence alignments
- Multiple sequence alignments
- Models for protein families: profile Hidden Markov Models
- Evolutionary models
- Phylogenetic Trees
- Signals in sequences: Gene regulation
- Gene prediction
- Sequence assembly
- Sequence comparisons for special cases: high similarity matches using index structures
- Algorithms for next-generation sequencing: *-Seq

In this course we will introduce the necessary theory, the relevant algorithmic developments, and, through hands-on projects, practical aspects of solving small bioinformatics problems. An emphasis is put on recent developments in the field and on showing the interplay between the algorithmic development and the statistical modeling driven by the biological question at hand. We will introduce, respectively revisit, dynamic programming, shortest path algorithms, trees, string searching using index structures, multinomial distributions, Markov chains, Hidden Markov Models, the Maximum-Likelihood principle, Bayesian statistics, and Markov Chain Monte Carlo.

**Prerequisites:** Elementary algorithms, linear algebra, discrete math and probability theory. Students are expected to be proficient in a programming language at least to the point of implementing matrix multiplication or dynamic programming. A grade of C or better in "Analyzing Numbers in Biology" (16:118:617:02; 01:694:420; 01:750:487:01) is sufficient for fulfilling the prerequisites. For CS students: CS206, CS344 and a programming class will suffice. No biology background is required.

**Grading:** The course will consist of instructor's lectures, graded homework problems, student presentations and class projects. Grades depend on active participation in lectures, graded homework, class projects, a written midterm and a final project. Students may propose their own final projects and projects may be done in groups.

**Textbook:** There will be no textbook; we will use individual chapters from appropriate texts, class notes and original literature.

**Course Website:** Further information about the course and course materials will be published on the Sakai website.