Yihang Shen successfully defends his dissertation

April 10, 2024 — Yihang Shen has successfully defended his Ph.D. Congratulations Dr. Shen!

Yihang Shen (2024) Automated hyper-parameter tuning and its applications in computational biology. Ph.D. Thesis (CMU-CB-24-100).

Hyper-parameters play a crucial role in the efficacy of machine learning and computational biology tools. Their optimal selection profoundly impacts tool per- formance, yet manual tuning of these hyper-parameters can be a laborious process, demanding extensive domain knowledge. Therefore, the development of algorithms for automatic hyper-parameter tuning is important. While numerous strategies for hyper-parameter tuning in machine learning exist, certain challenges remain inade- quately addressed. These include tuning hyper-parameters (1) in contexts where the search space exhibits unique characteristics, such as high dimensionality; and (2) for computational biology tools, where optimal settings are closely tied to the specific biological sample being analyzed. In response to these challenges, this dissertation focuses on the development of novel algorithms designed for the automatic tuning of hyper-parameters across various tool types.

In the first part, we focus on developing new Bayesian Optimization methods for hyper-parameter tuning in high-dimensional search spaces. Bayesian Optimization is a machine learning method that is widely used in various scenarios, including the tuning of hyper-parameters. Yet, its application in high-dimensional search spaces presents a significant challenge that remains to be fully addressed. To overcome this, we develop a new high-dimensional Bayesian Optimization framework based on the concept of variable selection and show that the new method is more computationally efficient than previous high-dimensional Bayesian Optimization methods.

In the second part, we focus on developing hyper-parameter tuning algorithms for transcript assemblers. Transcript assemblers are tools for reconstructing expressed transcripts from the reads in a given RNA-seq sample. Given that these tools have many tunable hyper-parameters, and their optimal configurations greatly depend on the characteristics of the input sample, it is crucial to develop automatic tuning methods that adapt to different inputs. We develop the first adaptive, sample-specific hyper-parameter tuning system for transcript assemblers. This innovation marks an important advancement towards more precise transcript assembly, which in turn will enhance downstream RNA-seq analyses such as transcript quantification.

In high-throughput sequencing biological data analysis, the initial step is to align reads to a linear reference genome to determine their genomic locations. Recog- nizing that genetic variations differ among biological samples, it is crucial to use a sample-specific reference genome rather than a default one. Therefore, automati- cally deducing the sample-specific reference genome directly from the sample data becomes an important problem. It is a unique hyper-parameter tuning problem, where the reference genome represents the hyper-parameter and the search space encompasses various potential genomes. In the last part, we focus on developing algorithms to infer genomes from Hi-C, a distinct type of high-throughput sequencing data providing insights into the spatial arrangement of chromosomes. We show that using an inferred genome improves downstream Hi-C analyses, thereby contributing to a more profound understanding of chromosomal organization and function.