Analyzing large genomic datasets require enormous amount of computational resources in terms of running time, impact on cache, memory and disk space. We are working on finding alternate, reduced representations of these datasets which will enable downstream applications to work much more efficiently.
We are also investigating the effect of limited cache on bioinformatics tools, and looking for ways to overcome the difficulties it poses to k-mer counting and read mapping problem. Cache efficiency can play a very important role in designing computationally feasible tools for handling large datasets. With Turtle, we demonstrated that asymptotically more expensive algorithms can outperform less expensive algorithms by being cache efficient.
Roy, Rajat S. and Bhattacharya, Debashish and Schliep , Alexander. Turtle: Identifying frequent k-mers with cache-efficient algorithms (2014) [details]
Mahmud, Md. Reduced representations for efficient analysis of genomic data; from microarray to high throughput sequencing (2014) [details]
Mahmud, Md and Schliep, Alexander. TreQ-CG: Clustering Accelerates High-Throughput Sequencing Read Mapping (2014) [details]
Mahmud, Md and Wiedenhoeft, John and Schliep, Alexander . Indel-tolerant Read Mapping with Trinucleotide Frequencies using Cache-Oblivious kd-Trees (2012) [details]
Mahmud, Md and Schliep, Alexander. Speeding Up Bayesian HMM by the Four Russians Method (2011) [details]
Mahmud, Md and Schliep, Alexander. Fast MCMC Sampling for Hidden Markov Models to Determine Copy Number Variations (2011) [details]