Next generation sequencing experiments produce millions of short reads from target genomes in a cost-efficient manner. Higher throughput brings new challenges such as how to map these short reads efficiently and how to deal with errors introduced by sequencing machines. Currently, we are investigating the problem of read-mapping in an indel-tolerant manner. We are also looking for techniques which will efficiently map reads back to reference genome.
Genome assembly is one of the fundamental problems in Bioinformatics. Assembly can be either reference guided--when we have a reference genome that is similar to the genome we want to assemble--or de novo - when the genome is reconstructed only from reads available from sequencing machines. With sequencing getting cheaper by the day, researchers are interested in assembling genomes of more and more organisms. The main bottleneck here is the lack of reliable de novo assembly tools for Next Generation Sequencing data (the cheaper but shorter reads). We wish to investigate various aspects of the de novo assembly problem such as read filtering and correcting, contig building, scaffolding, etc.
Algorithm Animations are a way of interactively exploring the dynamics of algorithms as they compute solutions. In particular graph algorithms are natural candidates for such animations and indeed there already exists a variety of packages and programs to animate the dynamics when solving problems from graph theory. Still, and somewhat surprisingly, it can be difficult to understand the ideas behind the algorithm from the dynamic display alone. We explore novel animations, technical solutions to the problem of integrating animations into programming exercises and integrating animations with coursework.
Hidden Markov Models are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons often parameters of an HMM are estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches using Markov Chain Monte Carlo (MCMC) sampling. While their advantages have been clearly demonstrated, the likelihood based approaches are preferred in practice for their lower running times. We propose an approximate sampling technique inspired by discrete sequence compression for HMM and kd-trees to leverage spatial relations between data points in typical data sets to speed up the MCMC sampling.
In-Situ Hybridization experiments elucidate the spatial distribution of expressed mRNA in organisms. In particular for Drosophila large amounts of data for several developmental stages are available, complementing the DNA-microarray gene expression experiments. We have developed a image processing pipeline and a framework for joint analysis, which allows to detect co-located co-expressed genes from fused data sets.
Tuberculosis is one of the most widespread diseases in the world, with about a third of the population infected. While most infections are asymptomatic, latent tuberculosis can progress into an acute and life-threatening condition. As most infections occur in third-world countries where medical practice often remains below standard due to challenging circumstances, and high prevalence of AIDS leads to more active TB, the prolonged misuse of antibiotics has led to multiresistent strains, and several first-line and second-line antibiotics have been found to be ineffective. This project aims at modeling the macrophage infection mechanism using high-throughput experimentation and the development of novel algorithms to the associated computational challenges, in order to gain a systems level understanding of the infection process, which might facilitate new hypotheses about potential new drug targets.
We work in collaboration with Ralf Spörle from the Department of Developmental Genetics, Christian Hege, head of the Visualization Department at the Konrad-Zuse Zentrum (ZIB) and Bernd Fischer, Professor at the University of Lübeck, on the construction of an atlas of gene expression patterns in embryonal mice. The central piece is the construction of a non-linear registration, that maps numerous in-situ tomograms onto an annotated standard model. This mapping yields then an automatical anatomical annotation of high-resolution 3D spatial expression patterns as well as the fusion of all patterns into one standard model. The mapped expression patterns can then be viewed and analyzed together within the standard model. Analysis of the data involves statistical group testing for functional territories.
Analyzing large genomic datasets require enormous amount of computational resources in terms of running time, impact on cache, memory and disk space. We are working on finding alternate, reduced representations of these datasets which will enable downstream applications to work much more efficiently. We are also investigating the effect of limited cache on bioinformatics tools, and looking for ways to overcome the difficulties it poses in k-mer counting and read mapping.
Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). These feature sets used by current tools for pre-miRNA recognition differ in construction and dimension. Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. Current tools achieve similar predictive performance even though the feature sets used - and their computational cost - differ widely. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests.
The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the hematopoietic system. Gene expression data of cells of various distinguishable developmental stages fosters the elucidation of the underlying molecular processes, which change gradually over time and lock cells in certain lineages. We developed a statistical framework for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes and their similarities and differences.