Toward an Automated RNA-seq Bioinformatician


Measurement of gene expression — which genes are active in which conditions — is an indispensable tool for understanding biological systems. Analysis of gene expression from modern genomic sequencing technologies, such as RNA-seq, requires the use of sophisticated software such as read mappers, transcript assemblers, and expression abundance estimators. A software program implementing one of these steps typically has a large number of user-settable parameters that influence how the analysis algorithm performs. Scientists, biologists, and clinical researchers must often tune these parameters by hand or through other ad hoc means. The goal of this project is to automate this process by designing and implementing a framework for automatically learning high-performing parameters for gene expression analysis software. This project also aims to develop algorithms, software, and methodology to make this framework practical and useful. This will allow more researchers to obtain high-quality gene expression analyses with significantly less effort and will also enable improved analysis of large data sets where per-sample parameter tuning by hand is impractical. Reproducibility of biological results will also be enhanced since the choice of parameters is explicitly ceded to an automated, repeatable process. This research will make biological studies involving gene expression more accurate and less costly. A number of educational and outreach activities for various levels of students (elementary through undergraduate) are also planned to enhance community understanding of gene expression and its analysis.

Our developed processes will be implemented in several wrapper tools for parameter optimization that can be dropped into existing RNA-seq analysis pipelines to improve accuracy at each step. The research to design these tools will be broken down into several more tractable steps. The first step will be learning, for each tool, a collection of representative parameter vectors by analysing large collections of existing RNA-seq samples. In the second step, machine learning methods, based on a combination of techniques such as Bayesian Optimization, genetic algorithms, and classification approaches, will be used to design techniques to select parameter vectors from these sets that are predicted to offer high performance. The design of this system will also enhance our practical knowledge of techniques for such parameter optimization in other application domains within biology.

Carl Kingsford
Herbert A. Simon Professor of Computer Science



(2019). More accurate transcript assembly via parameter advising. The 2019 ICML Workshop on Computational Biology; journal version in Journal of Computational Biology.

Preprint PDF Code Project Project