August 26, 2024 — Haotian Teng has successfully defended his Ph.D. dissertation. Teng was co-advised by Carl Kingsford and Ziv Bar-Joseph. Congratulations Dr. Teng!
Many fundamental biological tasks require unsupervised learning where groundtruth labels are unavailable, but shallow unsupervised machine learning methods have poor performance on these tasks due to the complexity of the problem. With their strong representation power, deep learning models have been widely applied to solve challenging tasks; however, they usually require large amounts of labeled data.
To take advantage of the strong representation power of deep learning while applying it to unsupervised tasks, we developed several hybrid models that combine deep neural networks and unsupervised machine learning models. We used these models to improve performance on unsupervised biological tasks, including cell type clustering, basecalling, and molecular docking, demonstrating how a hybrid model can be used in solving spatial tasks (cell-type clustering), temporal tasks (basecalling), and spatial-temporal tasks (molecular docking).
First, we present an unsupervised cell type clustering model for recently developed single-molecule, spatially resolved transcriptomics data, where a deep neural network (NN) encoder is used to generate low-dimensional, Gaussian-distributed gene embeddings, which are then combined with spatial relationships using a Gaussian-Multinomial Mixture Model developed by us to predict cell type clustering.
The second problem we try to tackle is to call m6A methylated bases in RNA generated from long-read sequencing. m6A modification plays essential roles in regulating gene expression, but an efficient way to detect it systemically is lacking. The long-read sequencing from Oxford Nanopore Technologies has been shown to be sensitive to post-transcriptional modification, but an m6A sensitive basecaller for directly detecting this subtle sequencing signal has not yet been developed. We used a CNN-RNN (Convolutional-Recurrent Neural network) model previously developed by us for canonical basecalling to train a Non-homogeneous HMM (NHMM) where its transition matrix is conditioned on the deep NN output. Using the hybrid synthetically m6A methylation data sampled from the NHMM, we were able to train a NN basecaller to call m6A base. We applied our method to call the methylome in Yeast and Human RNA without the need for knock-out comparison data.
For the third application, we developed a deep generative model with a SE(3)-equivariant diffusion transformer to address pocket-based molecular docking, where the 3D structure of the ligand is to be predicted given the protein pocket. We applied our model to a virtual screening task to select effective JAK2 inhibitors, identifying 13 candidate compounds with high affinity scores confirmed by wet lab assays from a total of 9,137 drugs, two of which are new molecules that have never been reported before.