Kingsford Group - Papers

Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2024) Journal version of "How much data is sufficient to learn high-performing algorithms?". J. ACM 71(5):32, pages 1--58 .

The journal version of our 2021 STOC paper "How much data is sufficient to learn high-performing algorithms" has been accepted to the Journal of the ACM.

Continue reading...

Mona Singh, Cenk Sahinalp, Jianyang Zeng, Wei Vivian Li, Carl Kingsford, Qiangfeng Zhang, Teresa Przytycka, Joshua Welch, Jian Ma, and Bonnie Berger (2024) Voices: How has the AI boom impacted algorithmic biology?. Cell Systems 15(6): P483-487.

The AI boom has affected algorithmic computational biology by further bringing traditional algorithmic thinking and ML and AI techniques closer together. One area where this is particularly true is in the field of automated algorithm design, where AI is used to inform or predict aspects of the design of an algorithm.

Continue reading...

Yutong Qiu, Yihang Shen, and Carl Kingsford (2024) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. Alg. Mol. Biol..

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly.

Continue reading...

Guillaume Marçais, C.S. Elder, and Carl Kingsford (2024) k-nonical space: sketching with reverse complements. Bioinformatics.

Sequences equivalent to their reverse complements (i.e., double-stranded DNA) have no equivalent in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms.

Continue reading...

Yihang Shen, Zhiwen Yan, and Carl Kingsford (2024) Adaptive, sample-specific parameter selection for more accurate transcript assembly. bioRxiv.

Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples.

Continue reading...

Minh Hoang and Carl Kingsford (2024) Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation. ICLR 2024.

We tackle the problem of meta-learning across heterogenous tasks. This problem seeks to extract and generalize transferable meta-knowledge through streaming task sets from a multi-modal task distribution. The extracted meta-knowledge can be used to create predictors for new tasks using a small number of labeled samples. Most meta-learning methods assume a homogeneous task distribution, thus limiting their generalization capacity when handling multi-modal task distributions. Recent work has shown that the generalization of meta-learning depends on the similarity of tasks in the training distribution, and this has led to many clustering approaches that aim to detect homogeneous clusters of tasks. However, these methods suffer from a significant increase in parameter complexity.

Continue reading...

C.S. Elder, Minh Hoang, Mohsen Ferdosi, and Carl Kingsford. (2024) A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. RECOMB 2024.

The Beltway and Turnpike problems entail the reconstruction of circular and linear one-dimensional point sets from unordered pairwise distances. These problems arise in computational biology when the measurements provide distances but do not associate those distances with the entities that gave rise to them. Such applications include molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes (since sequencing and mass spec technologies can give lengths or weights, usually without connecting them to end points). Practical algorithms for Turnpike are known when the distance measurements are accurate, but both problems become strongly NP-complete under any level of measurement uncertainty. This is problematic since all known applications experience some degree of uncertainty from uncontrollable factors.

Continue reading...

Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, and Carl Kingsford (2024) Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework. Genome Research 34:1987-1999.

Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps.

Continue reading...

Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang, and Carl Kingsford (2024) Improving Hi-C contact matrices using genome graphs. RECOMB 2024.

Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes.

Continue reading...

Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2023) Sketching methods with small window guarantee using minimum decycling sets. Journal of Computational Biology.

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment.

Continue reading...

Hossein Asghari, Ehsan Haghshenas, Roby Thomas, Eric Schultz, Rob Patro, Stan Skrzypczak, and Carl Kingsford (2023) Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC. Cancer Res (2023) 83 (7_Supplement): 1400.

View source

AC Chiang, H Asghari, K Ashley, S Gettinger, S Goldberg, R Herbst, FH Wilson, BR Newton, MK Cohenuram, KD Sabbath, AS Talsania, AV Russo, E Schultz, S Skrzypczak, C Kingsford, and KS Schalper (2023) Molecular predictors and immunomodulatory role of dual checkpoint inhibitor blockade using ipilimumab/nivolumab in patients with extensive stage small cell lung cancer. J Clin Oncol 41, 2023 (suppl 16; abstr 8597).

View source

Mohsen Ferdosi, Yuejun Ge, and Carl Kingsford (2023) Reinforcement Learning for Robotic Liquid Handler Planning. In Proceedings of WABI 2023.

Robotic liquid handlers play a crucial role in automating laboratory tasks such as sample preparation, high-throughput screening, and assay development. Manually designing protocols takes significant effort, and can result in inefficient protocols and involve human error. We investigate the application of reinforcement learning to automate the protocol design process resulting in reduced human labor and further automation in liquid handling.

Continue reading...

Minh Hoang, Guillaume Marçais, and Carl Kingsford (2024) Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. Journal of Computational Biology 31(1):2-20.

Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved.

Continue reading...

Yihang Shen and Carl Kingsford (2023) Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection. In Proceedings of AutoML 2023.

Bayesian Optimization (BO) is a method for globally optimizing black-box func- tions. While BO has been successfully applied to many scenarios, developing effective BO algorithms that scale to functions with high-dimensional domains is still a challenge. Optimizing such functions by vanilla BO is extremely time-consuming.

Continue reading...

Hongyu Zheng, Guillaume Marçais, and Carl Kingsford (2023) Creating and Using Minimizer Sketches in Computational Genomics. J Comp. Biol. (2023).

Processing large datasets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more.

Continue reading...

Yutong Qiu, Yihang Shen, and Carl Kingsford (2023) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. In Proceedings of WABI 2023.

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error- prone process of genome assembly.

Continue reading...

May 5, 2023 — Yutong Qiu, a CPCB Ph.D. student, successfully defended her dissertation titled “Algorithmic Foundations of Genome Graph Construction and Comparison.” She will join Illumina, Inc. Congratulations, Dr. Qiu!

Yutong Qiu (2023) Algorithmic Foundations of Genome Graph Construction and Comparison. Ph.D. Thesis CMU-CB-23-101.

Pangenomic studies have enabled a more accurate depiction of the human genome landscape. Genome graphs are suitable data structures for analyzing collections of genomes due to their efficiency and flexibility of encoding shared and unique substrings from the population of encoded genomes.

Continue reading...

October 28, 2022 — Laura Tung successfully defended her Ph.D. thesis. She will join Gaurdant Health, Inc. Congratulations, Dr. Tung!

Laura H. Tung (2022) Algorithms and computational methods for transcriptome analysis . Ph.D. (CMU-CB-22-108).

Continue reading...

Yutong Qiu and Carl Kingsford (2022) The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance. In Proceedings of ISMB 2022 in Bioinformatics 38(Supplement_1):i404-412 (2022)..

Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets.

Continue reading...

February 24, 2022 — Hongyu Zheng successfully defended his Ph.D. thesis. He will joint Princeton University as a postdoc. Congradulations, Dr. Zheng!

Hongyu Zheng (2022) Theory and Practice of Low-Density Minimizer Sketches. Ph.D. Dissertation.

Sequence sketch methods generate compact fingerprints of large string sets for efficient indexing and searching. Minimizers are one of such sketching methods, sampling k-mers from a string by selecting the minimal k-mer from each sliding window with a predetermined ordering. Minimizers sketches preserve information to detect sufficiently long substring matches.

Continue reading...

Minh Hoang, Hongyu Zheng, and Carl Kingsford (2022) DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes. In Proceedings of RECOMB 2022..

Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite its widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question.

Continue reading...

Minh Hoang and Carl Kingsford (2021) Personalized neural architecture search for federated learning.. In NeurIPS 2021 Workshop on New Frontiers in Federated Learning 2021..

Federated Learning (FL) is a recently proposed learning paradigm for decentralized devices to collaboratively train a predictive model without exchanging private data. Existing FL frameworks, however, assume a one-size-fit-all model architecture to be collectively trained by local devices, which is determined prior to observing their data.

Continue reading...

Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2019) How much data is sufficient to learn high-performing algorithms?. STOC 2021 (preprint arXiv:1908.02894 [cs.LG] (2019)).

Algorithms for scientific analysis typically have tunable parameters that significantly influence computational efficiency and solution quality. If a parameter setting leads to strong algorithmic performance on average over a set of typical problem instances, that parameter setting---ideally---will perform well in the future.

Continue reading...

Toshiki Masuishi, Hiroya Taniguchi, Daisuke Kotani, Hideaki Bando, Taroh Satoh, Taito Esaki, Yoshito Komatsu, Yu Sunakawa, Tomohiro Nishina, Eiji Shinozaki, Naohiro Nishida, Masato Komoda, Satoshi Yuki, Naoki Izawa, Gaurav Sharma, Stan Skrzypczak, Eric Schultz, Carl Kingsford, Akihiro Sato, and Takayuki Yoshino (2021) Discovery of a potential predictive marker for eribulin treatment and novel target genes in BRAF V600E mutant metastatic colorectal cancer using an AI-driven RNA-seq analysis platform: Translational research of the BRAVERY study (EPOC1701). Journal of Clinical Oncology 39(15 suppl): e15532 (2021).

View source

Yutong Qiu and Carl Kingsford (2021) Constructing small genome graphs via string compression. ISMB 2021.

The size of a genome graph -- the space required to store the nodes, their labels and edges -- affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. The size of the graph also affects the size of the graph index that is used to speed up the alignment. This raises the need for approaches to construct space-efficient genome graphs.

Continue reading...

Laura H. Tung and Carl Kingsford (2021) Practical selection of representative sets of RNA-seq samples using a hierarchical approach. ISMB 2021.

Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools.

Continue reading...

Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2021) Sequence-specific minimizers via polar sets. ISMB 2021.

Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size.

Continue reading...

Avi Srivastava, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Carl Kingsford, and Rob Patro (2021) Accounting for fragments of unexpected origin improves transcript quantification in RNA-seq simulations focused on increased realism. bioRxiv, doi: https://doi.org/10.1101/2021.01.17.426996.

Transcript and gene quantification is the first step in many RNA-seq analyses. While many factors and properties of experimental RNA-seq data likely contribute to differences in accuracy between various approaches to quantification, it has been demonstrated (1) that quantification accuracy generally benefits from considering, during alignment, potential genomic origins for sequenced fragments that reside outside of the annotated transcriptome.

Continue reading...

Natalie Sauerwald and Carl Kingsford (2021) Capturing the complexity of topologically associating domains through multi-feature optimization. bioRxiv.

The three-dimensional structure of human chromosomes is tied to gene regulation and replication timing, but there is still a lack of consensus on the computational and biological definitions for chromosomal substructures such as topologically associating domains (TADs). TADs are described and identified by various computational properties leading to different TAD sets with varying compatibility with biological properties such as boundary occupancy of structural proteins.

Continue reading...

October 12, 2020 — Cong Ma has successfully defended her Ph.D. Congratulations Dr. Ma! Cong will join Princeton University as a postdoc.

Cong Ma (2020) Detecting anomalies in inferred transcript sequences and expression from RNA-seq. Ph.D. Thesis.

Anomalies are data points that do not follow established or expected patterns. When measuring gene expression, anomalies in RNA-seq are observations or pat- terns that cannot be explained by the inferred transcript sequences or expressions. Transcript sequences and expression are key indicators for cell status and are used in many phenotypic and disease analyses.

Continue reading...

Yihang Shen and Carl Kingsford (2020) Probabilistic method corrects previously uncharacterized Hi-C artifact. bioRxiv.

Three-dimensional chromosomal structure plays an important role in gene regulation. Chromosome conformation capture techniques, especially the high-throughput, sequencing-based technique Hi-C, provide new insights on spatial architectures of chromosomes. However, Hi-C data contains artifacts and systemic biases that substantially influence subsequent analysis.

Continue reading...

Natalie Sauerwald (2020) Algorithms for the study of chromosomal structure variability . Ph.D. Dissertation.

October 5, 2020 — The last two decades have introduced several experimental methods for studying three-dimensional chromosome structure, opening up a new dimension of genomics. Studies of these new data types have shown great promise in explaining some of the open questions in gene regulation, but the experiments are indirect and imperfect measurements of the underlying structure, requiring rigorous computational methods.

Continue reading...

Cong Ma, Hongyu Zheng, and Carl Kingsford (2020) Exact Transcript Quantification Over Splice Graphs. In Proceedings of WABI 2020; journal version in Alg. Mol. Bio. 2021.

The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript.

Continue reading...

J Lee, ST Kim, KM Kim, E Schultz, S Skrzypczak, R Patro, and C Kingsford (2020) Novel target discovery in pembrolizumab-resistant gastric cancer using a comprehensive RNA-seq analysis pipeline. Journal of Clinical Oncology 38(15_suppl), e16541-e16541 (2020).

View source

Minh Hoang and Carl Kingsford (2020) Optimizing Dynamic Structures with Bayesian Generative Search. ICML 2020.

Kernel selection for kernel-based model is a difficult problem. Exact solutions are, in general, prohibitively expensive to solve for due to the NP-hard nature of these combinatorial tasks. In addition, gradient-based optimization techniques are inapplicable due to the non-differentiability of the objective function. As such, many state-of-the-art solutions for this problem resorts to heuristic search and gradient-free optimization.

Continue reading...

Trevor Frisby, Shawn James Baker, Guillaume Marçais, Quang Minh Hoang, Carl Kingsford, and Christopher James Langmead (2020) Harvestman: A framework for hierarchical feature learning and selection from whole genome sequencing data.. BMC Bioinformatics 22:174 (2021).

We present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, the largest publicly available collection of whole genome sequences.

Continue reading...

Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2020) Improved design and analysis of practical minimizers. Proceedings of ISMB 2020 (Bioinformatics).

Minimizers are methods to sample k-mers from a sequence, with the guarantee that similar set of k-mers will be chosen on similar sequences. It is parameterized by the k-mer length k, a window length w and an order on the k-mers. Minimizers are used in a large number of softwares and pipelines to improve computation efficiency and decrease memory usage.

Continue reading...

Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2020) Lower density selection schemes via small universal hitting sets with short remaining path length. Proceedings of RECOMB 2020 (preprint: arXiv:2001.06550 [cs.DS]).

Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size.

Continue reading...

Prashant Pandey, Yinjie Gao, and Carl Kingsford (2019) VariantStore: A Large-Scale Genomic Variant Search Index. Genome Biology *22*: 231 (2021).

The ability to efficiently query genomic variation data from thousands of samples is critical to achieve the full potential of many medical and scientific applications such as personalized medicine. We present VariantStore, a system for efficiently indexing and searching millions of genomic variants across thousands of samples.

Continue reading...

Cong Ma and Carl Kingsford (2019) Estimating mutual information under measurement error. bioRxiv 852384.

Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases.

Continue reading...

Natalie Sauerwald and Carl Kingsford (2019) A statistical method for identifying consistently important features across samples. BioRxiv.

In many applications, features with consistently high measurements across many samples are particularly meaningful and useful for quality control or biological interpretation. Identification of these features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values.

Continue reading...

Cong Ma and Carl Kingsford (2019) Detecting, categorizing, and correcting coverage anomalies of RNA-seq quantification. Cell Systems *9*:1-11.

Due to incomplete reference transcriptomes, incomplete sequencing bias models, or other modeling defects, algorithms to infer isoform expression from RNA-seq sometimes do not accurately model expression. We present a computational method to detect instances where a quantification algorithm could not completely explain the input reads. Our approach identifies regions where the read coverage significantly deviates from expectation.

Continue reading...

HuBMAP Consortium (2019) The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature *574*:187-192 (2019).

Transformative technologies are enabling the construction of three-dimensional maps of tissues with unprecedented spatial and molecular resolution.

Continue reading...

Scott A. Keith, Rory Eutsey, Heewook Lee, Brad Solomon, Stacie Oliver, Carl Kingsford, N. Luisa Hiller, and Brooke M. McCartney (2019) Identification of Microbiota-Induced Gene Expression Changes in the Drosophila melanogaster Head. bioRxiv.

Symbiotic microorganisms exert multifaceted impacts on the physiology of their animal hosts. Recent discoveries have shown the gut microbiota influence host brain function and behavior, but the host and microbial molecular factors required to actuate these effects are largely unknown.

Continue reading...

Yutong Qiu, Cong Ma, Han Xie, and Carl Kingsford (2019) Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 18:1--18:19 (2019); journal version in Alg Mol Bio (2020).

Transcriptomic structural variants (TSVs) --- structural variants that affect expressed regions --- are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs.

Continue reading...

Avi Srivastava, Laraib Malik, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro (2019) Alignment and mapping methodology influence transcript abundance estimation. Genome Biology 21: 239..

The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. We investigate the effect of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential gene expression analysis.

Continue reading...

Dan DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marcais (2019) Practical universal k-mer sets for minimizer schemes. To appear in ACM-BCB 2019.

Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same.

Continue reading...

Hongyi Xin, Mingfu Shao, and Carl Kingsford (2019) Context-Aware Seeds for Read Mapping. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 15:1--15:13 (2019) (journal version in Alg Mol Bio).

Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers.

Continue reading...

Laura H. Tung, Mingfu Shao, and Carl Kingsford (2019) Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads. Genome Biology *20*:287.

Third-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates.

Continue reading...

Quang Minh HoĂ ng, Nghia Hoang, Bryan Low, and Carl Kingsford (2019) Collective Model Fusion for Multiple Black-Box Experts. ICML PMLR 97:2742-2750, 2019.

Model fusion is a fundamental problem in collective machine learning (ML) where independent experts with heterogeneous learning architectures are required to combine expertise to improve predictive performance. This is particularly challenging in information-sensitive domains (e.g., medical records in health-care analytics) where experts do not have access to each other's internal architecture and local data.

Continue reading...

Guillaume Marçais, Brad Solomon, Rob Patro, and Carl Kingsford (2019) Sketching and Sublinear Data Structures in Genomics. Annual Review of Biomedical Data Science *2*:93-118.

Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.

View source

Vaibhav Rajan, Carl Kingsford, and Xiuwei Zhang (2019) Maximum Likelihood Reconstruction of Ancestral Networks by Integer Linear Programming. bioRxiv.

The evolutionary history of biological networks enables deep functional and evolutionary understanding of various bio-molecular processes. Network growth models, such as the Duplication-Mutation with Complementarity (DMC) model, provide a principled approach to characterizing the evolution of protein-protein interactions (PPI) based on duplication and divergence. Current methods for model-based ancestral network reconstruction, primarily use greedy heuristics and yield sub-optimal solutions.

Continue reading...

Natalie Sauerwald, Akshat Singhal, and Carl Kingsford (2020) Analysis of the structural variability of topologically associated domains as revealed by Hi-C. Nuc. Acids Res. Genomics and Bioinformatics 2(1):lqz008 (2020).

Three-dimensional chromosome structure plays an integral role in gene expression and regulation, replication timing, and other cellular processes. Topologically associated domains (TADs), building blocks of chromosome structure, are genomic regions with higher contact frequencies within the region than outside the region. A central question is the degree to which TADs are conserved or vary between conditions.

Continue reading...

Dan DeBlasio, Kwanho Kim, and Carl Kingsford (2019) More accurate transcript assembly via parameter advising. The 2019 ICML Workshop on Computational Biology; journal version in Journal of Computational Biology.

Computational tools used for genomic analyses are becoming more accurate but also increasingly sophisticated and complex. This introduces a new problem in that these pieces of software have a large number of tunable parameters which often have a large influence on the results that are reported.

Continue reading...

Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford (2019) Locality sensitive hashing for the edit distance. Bioinformatics 35(14):i127-i135 (ISMB) 2019.

Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment.

Continue reading...

Natalie Sauerwald, Yihang Shen, and Carl Kingsford (2019) Topological data analysis reveals principles of chromosome structure throughout cellular differentiation. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 23:1--23:16 (2019).

Topological data analysis (TDA) is a mathematically well-founded set of methods to derive robust information about the structure and topology of data. It has been applied successfully in several biological contexts.

Continue reading...

Heewook Lee and Carl Kingsford (2018) Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery. Genome Biology *19*:16.

View source

Heewook Lee and Carl Kingsford (2018) Accurate Assembly and Typing of HLA using a Graph-Guided Assembler Kourami. HLA Typing, pages 235-247.

View source

Cong Ma, Mingfu Shao, and Carl Kingsford (2018) SQUID: transcriptomic structural variation detection from RNA-seq. Genome Biology *19*:52.

View source

Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2018) Asymptotically optimal minimizers schemes. Bioinformatics (ISMB) *34*(13):i13-i22.

View source

Natalie Sauerwald and Carl Kingsford (2018) Quantifying the similarity of topological domains across normal and cancer human cell types. Bioinformatics (ISMB) *34*(13):i475-i483.

View source

Brad Solomon and Carl Kingsford (2018) Improved search of large transcriptomic sequencing databases using split sequence bloom trees. Journal of Computational Biology *25*(7):755-765.

View source

Hao Wang, Carl Kingsford, and C Joel McManus (2018) Using the Ribodeblur pipeline to recover A-sites from yeast ribosome profiling data. Methods *137*:67-70.

View source

Guillaume Marçais, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford (2017) Improving the performance of minimizers and winnowing schemes. Bioinformatics (ISMB) *33*(14):i110-117.

View source

Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford (2017) Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Computational Biology *13*(10):e1005777.

View source

Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods *14*:417-419.

View source

David Pellow, Darya Filippova, and Carl Kingsford (2017) Improving Bloom filter performance on sequence data using k-mer Bloom filters. Journal of Computational Biology *24*(6):547-557.

View source

Natalie Sauerwald, She Zhang, Carl Kingsford, and Ivet Bahar (2017) Chromosomal dynamics predicted by an elastic network model explains genome-wide accessibility and long-range couplings. Nucleic Acids Research *45*(7):3663-3673.

View source

Mingfu Shao and Carl Kingsford (2017) Theory and A Heuristic for the Minimum Path Flow Decomposition Problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics *16*(2):658-670.

View source

Mingfu Shao and Carl Kingsford (2017) Accurate assembly of transcripts through phase-preserving graph decomposition. Nature Biotechnology *35*:1167-1169.

View source

Brad Solomon and Carl Kingsford (2017) Improved search of large transcriptomic sequencing databases using split sequence bloom trees. International Conference on Research in Computational Molecular Biology, pages 257-271.

View source

Hao Wang, Joel McManus, and Carl Kingsford (2017) Accurate recovery of ribosome positions reveals slow translation of wobble-pairing codons in yeast. Journal of Computational Biology *24*(6):486-500.

View source

Nitish Gupta, Komal Sanjeev, Tim Wall, Carl Kingsford, and Rob Patro (2016) Efficient index maintenance under dynamic genome modification. ArXiv Preprint ArXiv:1604.03132.

View source

Farhad Hormozdiari, Fereydoun Hormozdiari, Carl Kingsford, Paul Medvedev, and Fabio Vandin (2016) The second decade of the international conference on research in computational molecular biology (RECOMB). International Conference on Research in Computational Molecular Biology, pages 3-16.

View source

Hiren Karathia, Carl Kingsford, Michelle Girvan, and Sridhar Hannenhalli (2016) A pathway-centric view of spatial proximity in the 3D nucleome across cell lines. Scientific Reports *6*:39279.

View source

Carl Kingsford (2016) Teaching Computation to Biologists (book review). Computing in Science & Engineering *18*(1):4-5.

View source

Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford (2016) Compact universal k-mer hitting sets. International Workshop on Algorithms in Bioinformatics, pages 257-268.

View source

Rob Patro, Raquel Norel, Robert J Prill, Julio Saez-Rodriguez, Peter Lorenz, Felix Steinbeck, Bjoern Ziems, Mitja Luštrek, Nicola Barbarini, Alessandra Tiengo, and others (2016) A computational method for designing diverse linear epitopes including citrullinated peptides with desired binding affinities to intravenous immunoglobulin. BMC Bioinformatics *17*:155.

View source

Emre Sefer, Geet Duggal, and Carl Kingsford (2016) Deconvolution of ensemble chromatin interaction data reveals the latent mixing structures in cell subpopulations. Journal of Computational Biology *23*(6):425-438.

View source

Emre Sefer and Carl Kingsford (2016) Diffusion archeology for diffusion progression history reconstruction. Knowledge and Information Systems, pages 1-25.

View source

Brad Solomon and Carl Kingsford (2016) Fast search of thousands of short-read sequencing experiments. Nature Biotechnology *34*:300-302.

View source

Pieter Spealman, Hao Wang, Gemma May, Carl Kingsford, and C Joel McManus (2016) Exploring ribosome positioning on translating transcripts with ribosome profiling. Post-Transcriptional Gene Regulation, pages 71-97.

View source

Hao Wang, Joel McManus, and Carl Kingsford (2016) Isoform-level ribosome occupancy estimation guided by transcript abundance with Ribomap. Bioinformatics *32*(12):1880-1882 .

View source

Darya Filippova and Carl Kingsford (2015) Rapid, separable compression enables fast analyses of sequence alignments. Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, pages 194-201.

View source

Carl Kingsford and Rob Patro (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics *31*(12):1920-1928 .

View source

Rob Patro and Carl Kingsford (2015) Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics *31*(17):2770-2777 .

View source

Emre Sefer and Carl Kingsford (2015) Convex risk minimization to infer networks from probabilistic diffusion data at multiple scales. 2015 IEEE 31st International Conference on Data Engineering, pages 663-674.

View source

Emre Sefer and Carl Kingsford (2015) Semi-nonparametric modeling of topological domain formation from epigenetic data. International Workshop on Algorithms in Bioinformatics, pages 148-161.

View source

Hongyi Xin, Sunny Nahar, Richard Zhu, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu (2015) Optimal seed solver: optimizing seed selection in read mapping. Bioinformatics *32*(11):1632-1642.

View source

Hongyi Xin, John Greth, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu (2015) Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. Bioinformatics *31*(10):1553-1560.

View source

Darya Filippova, Rob Patro, Geet Duggal, and Carl Kingsford (2014) Identification of alternative topological domains in chromatin. Algorithms for Molecular Biology, 9*14*.

View source

Rob Patro, Stephen M Mount, and Carl Kingsford (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nature Biotechnology *32*:462-464.

View source

Geet Duggal, Rob Patro, Emre Sefer, Hao Wang, Darya Filippova, Samir Khuller, and Carl Kingsford (2013) Resolving spatial inconsistencies in chromosome conformation measurements. Algorithms for Molecular Biology *8*:8.

View source

Geet Duggal, Hao Wang, and Carl Kingsford (2013) Higher-order chromatin domains link eQTLs with the expression of far-away genes. Nucleic Acids Research *42*(1):87-96 .

View source

Darya Filippova, Rob Patro, Geet Duggal, and Carl Kingsford (2013) Multiscale identification of topological domains in chromatin. International Workshop on Algorithms in Bioinformatics, pages 300-312.

View source

Rob Patro and Carl Kingsford (2013) Predicting protein interactions via parsimonious network history inference. Bioinformatics *29*(13):i237-i246.

View source

Hao Wang, Geet Duggal, Rob Patro, Michelle Girvan, Sridhar Hannenhalli, and Carl Kingsford (2013) Topological properties of chromosome conformation graphs reflect spatial proximities within chromatin. Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, pages 306-315.

View source

Geet Duggal, Rob Patro, Emre Sefer, Hao Wang, Darya Filippova, Samir Khuller, and Carl Kingsford (2012) Resolving spatial inconsistencies in chromosome conformation data. International Workshop on Algorithms in Bioinformatics, pages 288-300.

View source

Geet Duggal and Carl Kingsford (2012) Graph rigidity reveals well-constrained regions of chromosome conformation embeddings. BMC Bioinformatics *13*:241.

View source

Darya Filippova, Aashish Gadani, and Carl Kingsford (2012) Coral: an integrated suite of visualizations for comparing clusterings. BMC Bioinformatics *13*:276.

View source

Darya Filippova, Michael Fitzgerald, Carl Kingsford, and Fernando Benadon (2012) Dynamic exploration of recording sessions between jazz musicians over time. ASE/IEEE International Conference on Social Computing, pages 368-376.

View source

Rob Patro and Carl Kingsford (2012) Global network alignment using multiscale spectral signatures. Bioinformatics *28*(23):3105-3114.

View source

Robert Patro, Geet Duggal, Emre Sefer, Hao Wang, Darya Filippova, and Carl Kingsford (2012) The missing models: A data-driven approach for learning how networks grow. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 42-50.

View source

Rob Patro, Emre Sefer, Justin Malin, Guillaume Marçais, Saket Navlakha, and Carl Kingsford (2012) Parsimonious reconstruction of network evolution. Algorithms for Molecular Biology *7*:25.

View source

Cong Ma, Hongyu Zheng, and Carl Kingsford (2019) Finding ranges of optimal transcript expression quantification in cases of non-identifiability. RECOMB 2021.

Current expression quantification methods suffer from a fundamental but under-characterized type of error: the most likely estimates for transcript abundances are not unique. Probabilistic models are at the core of current quantification methods. The scenario where a probabilistic model has multiple optimal solutions is known as non-identifiability.

Continue reading...

David R Kelley and Carl Kingsford (2011) Extracting between-pathway models from E-MAP interactions using expected graph compression. Journal of Computational Biology *18*(3):379-390.

View source

Carl Kingsford, Elena Zaslavsky, and Mona Singh (2011) A cost-aggregating integer linear program for motif finding. Journal of Discrete Algorithms *9*(4):326-334.

View source

Guillaume Marçais and Carl Kingsford (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics *27*(6):764-770.

View source

Saket Navlakha and Carl Kingsford (2011) Network archaeology: uncovering ancient networks from present-day interactions. PLoS Computational Biology *7*(4):e1001119.

View source

Emre Sefer and Carl Kingsford (2011) Metric labeling and semi-metric embedding for protein annotation prediction. International Conference on Research in Computational Molecular Biology, pages 392-407.

View source

Joshua Wetzel, Carl Kingsford, and Mihai Pop (2011) Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics *12*:95.

View source

Geet Duggal, Saket Navlakha, Michelle Girvan, and Carl Kingsford (2010) Uncovering many views of biological networks using ensembles of near-optimal partitions. MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings at KDD.

View source

David R Kelley and Carl Kingsford (2010) Extracting between-pathway models from E-MAP interactions using expected graph compression. Annual International Conference on Research in Computational Molecular Biology, pages 248-262.

View source

Carl Kingsford, Michael C Schatz, and Mihai Pop (2010) Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics *11*:21.

View source

Grecia Lapizco-Encinas, Carl Kingsford, and James Reggia (2010) Particle swarm optimization for multimodal combinatorial problems and its application to protein design. IEEE Congress on Evolutionary Computation, pages 1-8.

View source

Niranjan Nagarajan and Carl Kingsford (2010) GiRaF: robust, computational identification of influenza reassortments via graph mining. Nucleic Acids Research *39*(6):e34 .

View source

Saket Navlakha and Carl Kingsford (2010) The power of protein interaction networks for associating genes with diseases. Bioinformatics *26*(8):1057-1063.

View source

Saket Navlakha and Carl Kingsford (2010) Exploring biological network dynamics with ensembles of graph partitions. Pacific Symposium on Biocomputing (PSB), pages 166-177.

View source

Saket Navlakha, James White, Niranjan Nagarajan, Mihai Pop, and Carl Kingsford (2010) Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information. Journal of Computational Biology *17*(3):503--516.

View source

James R White, Saket Navlakha, Niranjan Nagarajan, Mohammad-Reza Ghodsi, Carl Kingsford, and Mihai Pop (2010) Alignment and clustering of phylogenetic markers-implications for microbial diversity studies. BMC Bioinformatics *11*:152.

View source

Carl Kingsford and Guillaume Marçais (2009) Vertices of degree k in edge-minimal, k-edge-connected graphs. ArXiv Preprint ArXiv:0905.1064.

View source

Carl Kingsford and Guillaume Marçais (2009) A synthesis for exactly 3-edge-connected graphs. ArXiv Preprint ArXiv:0905.1053.

View source

Carl Kingsford, Niranjan Nagarajan, and Steven L Salzberg (2009) 2009 Swine-origin influenza A (H1N1) resembles previous influenza isolates. Plos One *4*(7):e6402 .

View source

Grecia Lapizco-Encinas, Carl Kingsford, and James Reggia (2009) A cooperative combinatorial particle swarm optimization algorithm for side-chain packing. 2009 IEEE Swarm Intelligence Symposium, pages 22-29.

View source

Saket Navlakha, Michael C Schatz, and Carl Kingsford (2009) Revealing biological modules via graph summarization. Journal of Computational Biology *16*(2):253-264.

View source

Saket Navlakha, James White, Niranjan Nagarajan, Mihai Pop, and Carl Kingsford (2009) Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information. Annual International Conference on Research in Computational Molecular Biology, pages 400-417.

View source

Carl Kingsford and Steven L Salzberg (2008) What are decision trees?. Nature Biotechnology *26*(9):1011-1013 .

View source

Niranjan Nagarajan and Carl Kingsford (2008) Uncovering genomic reassortments among influenza strains by enumerating maximal bicliques. 2008 IEEE International Conference on Bioinformatics and Biomedicine, pages 223-230.

View source

Carl Kingsford, Arthur L Delcher, and Steven L Salzberg (2007) A unified model explaining the offsets of overlapping and near-overlapping prokaryotic genes. Molecular Biology and Evolution *24*:2091-2098 .

View source

Carleton L Kingsford, Kunmi Ayanbule, and Steven L Salzberg (2007) Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biology *8*:R22.

View source

Steven L Salzberg, Carl Kingsford, Giovanni Cattoli, David J Spiro, Daniel A Janies, Mona Mehrez Aly, Ian H Brown, Emmanuel Couacy-Hymann, Gian Mario De Mia, Do Huu Dung, and others (2007) Genome analysis linking recent European and African influenza (H5N1) viruses. Emerging Infectious Diseases *13*(5):713-718 .

View source

Carl Kingsford, Elena Zaslavsky, and Mona Singh (2006) A compact mathematical programming formulation for DNA motif finding. Annual Symposium on Combinatorial Pattern Matching, pages 233-245.

View source

Bernard Chazelle, Carl Kingsford, and Mona Singh (2004) The inapproximability of side-chain positioning. Technical Report. Princeton University, Princeton.

View source

Bernard Chazelle, Carl Kingsford, and Mona Singh (2004) A semidefinite programming approach to side chain positioning with new rounding strategies. INFORMS Journal on Computing *16*:380-392.

View source

Carleton L Kingsford, Bernard Chazelle, and Mona Singh (2004) Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinformatics *21*:1028-1039.

View source

Bernard Chazelle, Carl Kingsford, and Mona Singh (2003) The side-chain positioning problem: a semidefinite programming formulation with new rounding schemes. Proceedings of the Paris C. Kanellakis memorial workshop on Principles of computing & knowledge: Paris C. Kanellakis memorial workshop on the occasion of his 50th birthday, pages 86-94.

View source

John H Reif, Thomas H LaBean, Michael Pirrung, Vipul S Rana, Bo Guo, Carl Kingsford, and Gene S Wickham (2001) Experimental construction of very large scale DNA databases with associative search capability. International Workshop on DNA-Based Computers, pages 231-247.

View source

View source