Department of Computer Science
 Rutgers University

Home page

Home page  Contact us  Site map 




On the feasibility of Heterogeneous Analysis of Large Scale Biological Data

I.G. Costa and A. Schliep

In Proceedings of ECML/PKDD 2006 Workshop on Data and Text Mining for Integrative Biology, 55–60, 2006.

Secondary information such as Gene Ontology (GO) annotations or location analysis of transcription factor binding is often relied upon validity of clusters, by considering whether individual terms or factors are significantly enriched in clusters. If such an enrichment indeed supports validity, it should be helpful in finding biologically meaningful clusters in the first place. One simple framework which allows to do so and which does not rely on strong assumptions about the data is semi- supervised learning. A primary data source, gene expression time-courses, is clustered and GO annotation or transcription factor binding information, the secondary data, is used to define pairwise constraints for pairs of genes for the computation of clusters. We show that this approach improves performance, when high quality labels are present, but naive use of the heterogeneous data routinely used for cluster validation will actually decrease performance in clustering.

Presented on Sept. 18, 2006 by Ivan Costa at ECML Workshop on Data and Text Mining for Integrative Biology (Contributed Talk).