May 5, 2023 — Yutong Qiu, a CPCB Ph.D. student, successfully defended her dissertation titled “Algorithmic Foundations of Genome Graph Construction and Comparison.” She will join Illumina, Inc. Congratulations, Dr. Qiu!
Pangenomic studies have enabled a more accurate depiction of the human genome landscape. Genome graphs are suitable data structures for analyzing collections of genomes due to their efficiency and flexibility of encoding shared and unique substrings from the population of encoded genomes.
Novel challenges arise when genome graphs are applied to thousands of genomes because current genome graph models are insufficient in addressing the questions: (1) How can genome graphs be constructed efficiently that optimize the storage space? (2) How can genome graphs be used to more accurately and more efficiently compare heterogeneous sequences such as cancer genomes or immune repertoires? To answer these questions, we lay algorithmic foundations for genome graph construction and comparison. The size of a genome graph is crucial to both efficient storage and analysis. How- ever, few genome graph construction methods directly optimize the graph size. By drawing connections to data compression, we develop an algorithmic framework for genome graph construction that prioritizes genome graph size and show that the new framework produces small genome graphs efficiently compared to other genome graph schemes. Our compression-based framework not only removes the depen- dency on hyper-parameters but also opens up the potential for adapting established compression algorithms to construct better genome graphs.
In many scenarios, such as immune repertoire analysis, we need to quantify the similarity between heterogeneous sets of genomic strings, but the complete strings are unknown due to limitations in sequencing technology. The distance between genome graphs can be used to estimate to the difference between these strings. One important metric is defined as the graph traversal edit distance (GTED). We revisit the complexity of and the previously proposed algorithms for GTED. We prove that GTED is NP-complete and show that the previously proposed algorithms computes a lower bound of GTED. In addition, we propose two correct ILP formulations of GTED and characterize the relationship between GTED and the previous lower bound ILPs. We evaluate the empirical efficiency of solving GTED and its lower bound ILP and show that solving GTED exactly with ILPs is currently not practical on larger genomes. Genome graphs are often highly expressive and represent more than one string sets, and thus the distance between two graphs using standard graph distances does not always model the actual edit distance between true string sets. To quantify this discrepancy, we formally define genome graph expressiveness as its diameter and use it to bound the deviation of the genome graph distance from string set distances. We produce a more accurate distance measure between (unseen) collections of strings encoded as genome graphs. The new distance measure and its deviation from string set distances are evaluated on simulated human T-cell repertoire sequences and Hepatitis B virus genomes.