October 12, 2020 — Cong Ma has successfully defended her Ph.D. Congratulations Dr. Ma! Cong will join Princeton University as a postdoc.
Anomalies are data points that do not follow established or expected patterns. When measuring gene expression, anomalies in RNA-seq are observations or pat- terns that cannot be explained by the inferred transcript sequences or expressions. Transcript sequences and expression are key indicators for cell status and are used in many phenotypic and disease analyses.
Identifying such unexplainable RNA-seq patterns can inspire improvements in the accuracy of inferred transcript sequences and expression of RNA-seq data and benefit the analyses based on transcripts. We develop computational methods to identify the RNA-seq anomalies that violate in- ferred sequence variation and expression patterns, and to improve the reconstructed transcripts such that they can explain the anomalies. The first type of anomaly that we detect is the large-scale sequence variation in transcriptome, or transcriptomic structural variants (TSVs). TSVs are usually in- duced by genomic structural variants, which can fuse sequences either from a pair of genes or involve intergenic regions. Previous TSV detection methods assume that TSVs only fuse a pair genes and do not consider that some genes are still un- known, thus many RNA-seq reads from the intergenic or intronic regions cannot be explained by gene fusions. We develop a computational method, SQUID, to identify fusions both between a pair of genes and involving non-transcribing regions, thus enlarging the set of explained variants and RNA-seq reads. SQUID is further ex- tended to the MULTIPLE COMPATIBLE ARRANGEMENTS PROBLEM, which is able to detect TSVs in the allele heterogeneity context. The second type of anomaly that we identify are coverage anomalies in estimated expression. The number of RNA- seq reads at each position along each transcript follows a distribution determined by the RNA-seq experiment protocol. We develop a method, Salmon Anomaly Detec- tion (SAD), to identify the transcripts with an unexplainable coverage distribution by RNA-seq protocol. We observe that both quantification algorithm mistakes and incomplete reference transcripts cause abnormal coverage patterns. We also develop an adjustment procedure to correct quantification algorithm mistakes indicated by coverage anomalies and improve the accuracy of estimated expression. Our analysis of the coverage anomalies shows that some of the coverage anomalies are indica- tors of the regulation efficiency of transcription factors and can explain a part of the variability of the target gene expression. The developed methods introduce novel dimensions to more completely explain RNA-seq data, and can be incorporated into RNA-seq analyses to better characterize phenotype-transcript relationships or used to evaluate future transcript reconstruction methods.