The Lab

Current Students

Alumni



Projects

Automated Biocuration Pipeline for RDoC Mental Health Categories

RDoC mental health categories recently introduced by NIH is the state-of-the-art reserach framweork for studying mental disorders. It provides a holistic way of describing mental disorders by integrating several levels of information from genomics to self-reports. However, there is only a limited amount of publicly available biomedical literature (e.g. PubMed articles) that are tagged with RDoC categories making its accessibility very limited among medical personal. Currently, we are collaborating with National Alliance on Mental Illness (NAMI) Montana through MSU’s Center for Mental Health Research and Recovery (CMHRR) for developing an automated tool for predicting RDoC categories for biomedical articles. However, manual annotation of biomedical leiterature with RDoC constructs is highly resource consumening. We are developing novel machine learning and natural language processing methods for categorizing biomedical literature using RDoC mental health categories.

This is a collaboration with Matt Kuntz (National Alliance on Mental Illness - Montana) and Dr. Neha John-Henderson (Department of Psychology).


Automated Protein Phenotype Prediction

The recently developed human phenotype ontology (HPO), which is very similar to GO, is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this project, we introduce PHENOstruct, a computational method that directly predicts the set of HPO terms for a given gene. We compare PHENOstruct with several baseline methods and show that it outperforms them in every respect. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.

Automated Protein Function Prediction

Proteins are the workhorses of life, and identifying their functions is a very important biological problem. The function of a protein can be loosely defined as everything it performs or happens to it. The Gene Ontology (GO) is a structured vocabulary which captures protein function in a hierarchical manner and contains thousands of terms. Through various wet-lab experiments over the years scientists have been able to annotate a large number of proteins with GO categories which reflect their functionality. However, experimentally determining protein functions is a highly resource-intensive task, and a large fraction of proteins remain un-annotated. Recently a plethora automated methods have emerged and their reasonable success in computationally determining the functions of proteins using a variety of data sources – by sequence/structure similarity or using various biological network data, has led to establishing automated function prediction (AFP) as an important problem in bioinformatics.

In a typical machine learning problem, cross-validation is the protocol of choice for evaluating the accuracy of a classifier. But, due to the process of accumulation of annotations over time, we identify the AFP as a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In this project, we analyze the performance of several protein function prediction methods in these two scenarios. Our results show that GOstruct, an AFP method that our lab has previously developed, and two other popular methods: binary SVMs and guilt by association, find it hard to achieve the same level of accuracy on these two tasks compared to the performance evaluated through cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We develop GOstruct 2.0 by proposing improvements which allows the model to make use of information of a current annotations to better handle the task of predicting novel annotations for previously annotated proteins. Experimental results on yeast and human data show that GOstruct 2.0 outperforms the original GOstruct, demonstrating the effectiveness of the proposed improvements.

Quality Assurance of Computational Functional Genomics Tools

Automated Function Prediction (AFP) tools will play a very important role in medicine and health care in the future. However, current tools predict different sets of Gene Ontology (GO) terms for the same input and only few terms are common with the experimentally validated terms. Experimentally validated terms are assumed incomplete. Consequently, biologists and developers will find difficulty in selecting and testing a tool, respectively.

Metamorphic Testing (MT) is a technique used to test programs for which the correct output is unknown or practically difficulty to determine. It checks whether the program behaves according to an expected set of properties: metamorphic relations (MRs). An MR specifies how a particular change to the input of the program should change the output. MR is violated when the change in output differs from the definition. In this project, we are exploring the feasibility of using Metamorphic Testing for tesing AFP tools.

This is a collaboration with Dr. Upulee Kanewala (School of Computing) and Dr. Diane Bimczok (Microbiology and Immunology).


Pioneering New Approaches to Explore Pangenomic Space at Scale

This project develops new software tools for pangenomic analysis, which studies genomic DNA sequences from multiple organisms to understand how organisms adapt their genomes to their environments. It is now routine for multiple genomes per species to be sequenced, giving much more information about the species. Our approach makes use of a graph-based representation of a pangenome and exploits this representation to efficiently find both shared and unique regions of interest.

This is a collaboration with Dr. Brendan Mumey (School of Computing) and National Center for Genome Resources (NCGR).


microRNA Prediction in Plants

microRNAs (miRNAs)are small non-coding RNAs acting as posttranscriptional regulators in gene expression. Experimental identification procedures for miRNAs are highly expensive and time consuming – computational prediction methods provide a solution. Homology based identification methods find miRNAs by comparing sequences to known miRNAs; they find conserved miRNAs in related species; they won’t identify divergent sequences. We are developing Ab-initio methods that can identify novel miRNAs that are not homologous using machine learning.

This is a collaboration with Dr. Hikmet Budak (MSU Cereal Genomics Lab).