Department of Computer Science
 Rutgers University

Home page

Home page  Contact us  Site map 



Completed Projects

GeneFinder: Ab initio gene finding

Ab initio gene finding or prediction is still an interesting problem. Current gene finder work well but especially predictions of eukaryotic genes have much space for improvement. Based on Anders Krogh's HMMGene our gene finder is used to test ideas to improve the predictions. One idea is to incorporate external data like gene expression to aid the signal and sequence composition driven prediction. The gene finder is implemented in Python using the python bindings of our own free (LGPL) HMM library General Hidden Markov Model library (GHMM).

ComplexDiseases: Analysis of complex disease data

Finding the genetic causes of complex diseases such as Autism and ADHD is complicated by ambiguities and subjectivities in the diagnostic process and the simultaneous involvement of multiple genes and environmental factors. We investigate the application of mixture model based clustering on fused geno- and phenotype data. This joint analysis might yield further insight into the complex interactions between geno- and phenotypes which underlie a specific disease pattern.

GenExpTimecourses: Analysis of gene expression time-courses

The molecular processes of life are dynamic over time. Microarray experiments measuring the expression levels of a multitude of genes over time are one way of gaining insight into the dynamic processes. As a first analysis groups of similar expression patterns are routinely identified. We have developed an approach which allows to use prior knowledge, is flexible and very robust to noise. The method is implemented in the software GQL which allows control of the analysis process by use of graphical user interfaces. Currently, we are extending our framework to allow integration of further data related to transcription or protein interactions. Furthermore, we are also investigating methodologies for validating clustering of genes with functional annotation.

ArrayCGH: Analyzing comparative genomic hybridization data

Detecting Chromosomal aberrations from ArrayCGH and gene expression ArrayCGH experimental data Chromosomal aberrations such as deletions or duplications of chromosomal regions are a crucial contributing factor to cancer. The aberrations can be detected by observing the relative hybridization intensities of healthy vs. diseased patients for BAC-clones covering complete genomes. A Hidden Markov Model with a inhomogeneous Markov Chain allows to reflect dependencies between overlapping clones.

CSIMixtures: Context-specific independence mixture modeling for sequence motifs

The modeling and analysis of sequence motives is one central task in the elucidation of biological processes such as gene regulation. The choice of model class is crucial to obtain a representation of the motive suitable for the biological application. For instance previous studies showed that for transcription factors which bind to divergent binding sites, mixtures of multiple PWMs increase performance. However, estimating a conventional mixture distribution for each position will in many cases cause overfitting. We avoid this problem by employing a context-specific independence (CSI) framework. In CSI mixtures model complexity is automatically adapted to match the variability found in a given data set.

ProteinComplexes: Delineation of protein complexes in yeast

The delineation of protein complexes from protein-protein interaction data is not as trivial as it may seem. We developed a simple probabilistic framework to cluster purifications while preserving the partial order relation among purifications. With a simple graph-based approach motivated by the asymmetric relationship between purifications, we can visualize overlapping components of protein complexes as supported by the experiment.

Tiling: Design of Tiling Arrays

Genomic tiling arrays are universal arrays in the sense that they cover complete genomes or chromosomes uniformly, in contrast to most other types of DNA microarrays for which specific sites of interest such as genes or splice sites are defined a priori. We define the problem of choosing optimal oligonucleotide probes from large candidate sets and provide efficient, linear-time in most instances, algorithms for solving it.

MicrorarrayDetection: Detecting biological agents with DNA Micorarray

DNA-Microarrays, well known for measuring gene expression levels, can be used for detecting presence or absence of biological targets (viruses of bacteria) from hybridization patterns of oligonucleotide probes and genomic DNA of agents. Due to sequence similarity of possible targets the use of non-unique oligonucleotides becomes necessary. With use of statistical group testing and phylogenetic information about targets, even the detection of novel targets becomes viable.

HomologyClassification: Detecting remote homologs as a classification problem

Detecting whether two proteins are homologs is one of the fundamental problems in bioinformatics. Classically, their sequence similarity is measured with a sequence alignment score and a decision about homology is made using score statistics. How well one can solve this classification problem is strongly influenced by the assumptions necessary for the statistics to hold. We use an approach based on Support Vector Machines to address this problem.

RemoteHomologues: Identifying clusters of remote homologues

Detecting proteins which share a common ancestor is an important step in understanding protein structure and function. Multi-domain proteins normally cause problems due to spurious similarities they induce; with a simple graph-based approach based on the concept of asymmetric similarity we were able to clearly outperform PSI-Blast.

MASCAAT: Meta-Learning for Selection and Combination of Clustering Algorithms Applied to Gene Expression Analysis

Whether to cluster at all, which clustering method to use and how many clusters to choose are pressing questions in bioinformatics. Mostly, decisions are made by users of clustering software based on experience guided by benchmarking or indicators for reliability of solutions or model-fit. However, as clustering algorithms always produce solutions, often inappropriate methods or parameters are used and invalid results produced. Meta-learning refers to the application of machine learning techniques in choosing methods and guiding in setting parameters. We intend to build a computational framework to perform cluster validation and apply meta-learning to the problem of analyzing gene expression time-courses. More information at the Project Page. Joint work funded funded by CAPES (Brazil) and DAAD (Germany) under the program Probral.