Laura Tung successfully defends her Ph.D. thesis

October 28, 2022 — Laura Tung successfully defended her Ph.D. thesis. She will join Gaurdant Health, Inc. Congratulations, Dr. Tung!

Laura H. Tung (2022) Algorithms and computational methods for transcriptome analysis . Ph.D. (CMU-CB-22-108).

Studying the transcriptome is crucial to understanding functional elements of the genome and elucidating biological pathways associated with disease. High- throughput sequencing such as RNA-seq has become a powerful tool for transcrip- tome analysis. Due to limited read lengths, identifying full-length transcripts from short reads remains challenging. As third-generation sequencing becomes increas- ingly important, single-molecule long reads have been used to improve transcrip- tome analyses such as isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis. This drives a need for transcript assembly on long reads.

We developed a reference-based long-read transcript assembler, Scallop-LR, aiming to discover more novel isoforms. Analyzing a considerable number of RNA-seq long-read samples, we quantified the benefit of performing transcript assembly on long reads. We demonstrate that Scallop-LR identifies more known transcripts and potentially novel isoforms for the human transcriptome than Iso-Seq Analysis and StringTie, indicating that long-read tran- script assembly by Scallop-LR can reveal a more complete human transcriptome.

Nanopore sequencing has become a leading choice for long-read RNA-seq. How- ever, Nanopore long reads have high error rates. For many non-model organisms without a high-quality reference, de novo (reference-free) error correction methods designed for RNA-seq long reads are needed. We developed a novel, error-profile-aware correction method, deepCorrRNA, for correcting RNA-seq long reads de novo using deep learning. deepCorrRNA combines a graph-based method and a deep neural network that incorporates the error profile related information systematically. We show that ML-based deepCorrRNA achieves comparable error-rate reductions to state-of-the-art ONT-specific isONcorrect. Across different organisms, deepCor-rRNA demonstrates robust de novo error correction capability, which can benefit the transcriptome studies of non-model organisms. deepCorrRNA’s method in principle is generalizable and may be applied to different technologies.

To accelerate transcriptome analyses, RNA-seq analysis tools require comprehensive evaluation and parameter optimization. While the number of RNA-seq samples grows enormously at large sequence databases, most RNA-seq analysis tools are evaluated on limited RNA-seq samples. This leads to a need to select a representative subset from RNA- seq samples at large databases, which effectively summarizes the original collection of RNA-seq samples. We developed a novel hierarchical representative set selection method, to tackle the memory and runtime challenges in k-mer counting approaches for RNA-seq samples in a large database. We demonstrate that hierarchical represen- tative set selection achieves summarization quality close to direct representative set selection, while largely reducing the runtime and memory usage, and substantially outperforms random sampling on the entire SRA set of human RNA-seq samples.

The algorithms, methods, and analysis we have developed can be used to improve transcriptome analyses and further our understanding of complex transcriptomes.

PDF