Kingsford Group

carlk@cs.cmu.edu

Herbert A. Simon Professor of Computer Science

Ray and Stephanie Lane Computational Biology Department

School of Computer Science

Carnegie Mellon University

Affiliate Faculty, Machine Learning Department

Director, Center for Machine Learning in Health

Director, Center for Innovation in Health

Co-Director, Joint CMU-Univ Pittsburgh Ph.D. Program in Computational Biology

CEO, Ocean Genomics

Twitter/X

Google Scholar

Room 7719, Gates-Hillman Complex
Carnegie Mellon University
5000 Forbes Ave
Pittsburgh, PA 15213
Continue reading...

August 26, 2024 — Haotian Teng has successfully defended his Ph.D. dissertation. Teng was co-advised by Carl Kingsford and Ziv Bar-Joseph. Congratulations Dr. Teng!

Haotian Teng (2024) Improving performance on unsupervised biological tasks with hybrid models. Ph.D. Thesis Tech Report.

Continue reading...

Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2024) Journal version of "How much data is sufficient to learn high-performing algorithms?". J. ACM 71(5):32, pages 1--58 .

The journal version of our 2021 STOC paper "How much data is sufficient to learn high-performing algorithms" has been accepted to the Journal of the ACM.

Continue reading...

Mona Singh, Cenk Sahinalp, Jianyang Zeng, Wei Vivian Li, Carl Kingsford, Qiangfeng Zhang, Teresa Przytycka, Joshua Welch, Jian Ma, and Bonnie Berger (2024) Voices: How has the AI boom impacted algorithmic biology?. Cell Systems 15(6): P483-487.

The AI boom has affected algorithmic computational biology by further bringing traditional algorithmic thinking and ML and AI techniques closer together. One area where this is particularly true is in the field of automated algorithm design, where AI is used to inform or predict aspects of the design of an algorithm.

Continue reading...

Yutong Qiu, Yihang Shen, and Carl Kingsford (2024) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. Alg. Mol. Biol..

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly.

Continue reading...

April 24, 2024 — Six out of 57 accepted RECOMB24 papers are authored by current or former Kingsford group members. These include 2 papers authored by current Ph.D. students, and 4 authored by former trainees of the group.

Continue reading...

April 10, 2024 — Yihang Shen has successfully defended his Ph.D. Congratulations Dr. Shen!

Yihang Shen (2024) Automated hyper-parameter tuning and its applications in computational biology. Ph.D. Thesis (CMU-CB-24-100).

Continue reading...

March 8, 2024 — Carl Kingsford, Herbert A. Simon Professor of Computer Science in the Ray and Stephanie Lane Computational Biology Department, has been elected as a Fellow of the International Society of Computational Biology (ISCB).

The ISCB notes that Carl has been chosen for this honor because he “is a trailblazer in computational molecular biology, showcasing sustained innovation in scalable algorithmic approaches.”

Continue reading...

Guillaume Marçais, C.S. Elder, and Carl Kingsford (2024) k-nonical space: sketching with reverse complements. Bioinformatics.

Sequences equivalent to their reverse complements (i.e., double-stranded DNA) have no equivalent in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms.

Continue reading...

Yihang Shen, Zhiwen Yan, and Carl Kingsford (2024) Adaptive, sample-specific parameter selection for more accurate transcript assembly. bioRxiv.

Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples.

Continue reading...

Minh Hoang and Carl Kingsford (2024) Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation. ICLR 2024.

We tackle the problem of meta-learning across heterogenous tasks. This problem seeks to extract and generalize transferable meta-knowledge through streaming task sets from a multi-modal task distribution. The extracted meta-knowledge can be used to create predictors for new tasks using a small number of labeled samples. Most meta-learning methods assume a homogeneous task distribution, thus limiting their generalization capacity when handling multi-modal task distributions. Recent work has shown that the generalization of meta-learning depends on the similarity of tasks in the training distribution, and this has led to many clustering approaches that aim to detect homogeneous clusters of tasks. However, these methods suffer from a significant increase in parameter complexity.

Continue reading...

C.S. Elder, Minh Hoang, Mohsen Ferdosi, and Carl Kingsford. (2024) A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. RECOMB 2024.

The Beltway and Turnpike problems entail the reconstruction of circular and linear one-dimensional point sets from unordered pairwise distances. These problems arise in computational biology when the measurements provide distances but do not associate those distances with the entities that gave rise to them. Such applications include molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes (since sequencing and mass spec technologies can give lengths or weights, usually without connecting them to end points). Practical algorithms for Turnpike are known when the distance measurements are accurate, but both problems become strongly NP-complete under any level of measurement uncertainty. This is problematic since all known applications experience some degree of uncertainty from uncontrollable factors.

Continue reading...

Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, and Carl Kingsford (2024) Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework. Genome Research, to appear..

Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps.

Continue reading...

Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang, and Carl Kingsford (2024) Improving Hi-C contact matrices using genome graphs. RECOMB 2024.

Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes.

Continue reading...

2023

Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2023) Sketching methods with small window guarantee using minimum decycling sets. Journal of Computational Biology.

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment.

Continue reading...

October 16, 2023 — Minh Hoang has successfully defended his Ph.D. dissertation, which is titled “Practical Methods for Automated Algorithm Design in Machine Learning and Computational Biology.” He will join Princeton University as a postdoc. Congratulations Dr. Hoang!

Minh Hoang (2023) Practical Methods for Automated Algorithm Design in Machine Learning and Computational Biology. Ph.D. Thesis.

Continue reading...

Mohsen Ferdosi, Yuejun Ge, and Carl Kingsford (2023) Reinforcement Learning for Robotic Liquid Handler Planning. In Proceedings of WABI 2023.

Robotic liquid handlers play a crucial role in automating laboratory tasks such as sample preparation, high-throughput screening, and assay development. Manually designing protocols takes significant effort, and can result in inefficient protocols and involve human error. We investigate the application of reinforcement learning to automate the protocol design process resulting in reduced human labor and further automation in liquid handling.

Continue reading...

Minh Hoang, Guillaume Marçais, and Carl Kingsford (2024) Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. Journal of Computational Biology 31(1):2-20.

Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved.

Continue reading...

Yihang Shen and Carl Kingsford (2023) Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection. In Proceedings of AutoML 2023.

Bayesian Optimization (BO) is a method for globally optimizing black-box func- tions. While BO has been successfully applied to many scenarios, developing effective BO algorithms that scale to functions with high-dimensional domains is still a challenge. Optimizing such functions by vanilla BO is extremely time-consuming.

Continue reading...

Hongyu Zheng, Guillaume Marçais, and Carl Kingsford (2023) Creating and Using Minimizer Sketches in Computational Genomics. J Comp. Biol. (2023).

Processing large datasets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more.

Continue reading...

June 10, 2023 — In order to survey the current frontier of the interface between AI methodology and biology and to chart future directions and challenges, we held an "NSF-NIH Joint Workshop on Emerging AI in Biology" in June 2023 that gathered approximately 40 experts on the intersection of research in AI and biology.

Continue reading...

Yutong Qiu, Yihang Shen, and Carl Kingsford (2023) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. In Proceedings of WABI 2023.

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error- prone process of genome assembly.

Continue reading...

May 5, 2023 — Yutong Qiu, a CPCB Ph.D. student, successfully defended her dissertation titled “Algorithmic Foundations of Genome Graph Construction and Comparison.” She will join Illumina, Inc. Congratulations, Dr. Qiu!

Yutong Qiu (2023) Algorithmic Foundations of Genome Graph Construction and Comparison. Ph.D. Thesis CMU-CB-23-101.

Pangenomic studies have enabled a more accurate depiction of the human genome landscape. Genome graphs are suitable data structures for analyzing collections of genomes due to their efficiency and flexibility of encoding shared and unique substrings from the population of encoded genomes.

Continue reading...

2022

October 28, 2022 — Laura Tung successfully defended her Ph.D. thesis. She will join Gaurdant Health, Inc. Congratulations, Dr. Tung!

Laura H. Tung (2022) Algorithms and computational methods for transcriptome analysis . Ph.D. (CMU-CB-22-108).

Continue reading...

Yutong Qiu and Carl Kingsford (2022) The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance. In Proceedings of ISMB 2022 in Bioinformatics 38(Supplement_1):i404-412 (2022)..

Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets.

Continue reading...

February 24, 2022 — Hongyu Zheng successfully defended his Ph.D. thesis. He will joint Princeton University as a postdoc. Congradulations, Dr. Zheng!

Hongyu Zheng (2022) Theory and Practice of Low-Density Minimizer Sketches. Ph.D. Dissertation.

Sequence sketch methods generate compact fingerprints of large string sets for efficient indexing and searching. Minimizers are one of such sketching methods, sampling k-mers from a string by selecting the minimal k-mer from each sliding window with a predetermined ordering. Minimizers sketches preserve information to detect sufficiently long substring matches.

Continue reading...

Minh Hoang, Hongyu Zheng, and Carl Kingsford (2022) DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes. In Proceedings of RECOMB 2022..

Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite its widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question.

Continue reading...

2021

Minh Hoang and Carl Kingsford (2021) Personalized neural architecture search for federated learning.. In NeurIPS 2021 Workshop on New Frontiers in Federated Learning 2021..

Federated Learning (FL) is a recently proposed learning paradigm for decentralized devices to collaboratively train a predictive model without exchanging private data. Existing FL frameworks, however, assume a one-size-fit-all model architecture to be collectively trained by local devices, which is determined prior to observing their data.

Continue reading...

Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2019) How much data is sufficient to learn high-performing algorithms?. STOC 2021 (preprint arXiv:1908.02894 [cs.LG] (2019)).

Algorithms for scientific analysis typically have tunable parameters that significantly influence computational efficiency and solution quality. If a parameter setting leads to strong algorithmic performance on average over a set of typical problem instances, that parameter setting---ideally---will perform well in the future.

Continue reading...

Yutong Qiu and Carl Kingsford (2021) Constructing small genome graphs via string compression. ISMB 2021.

The size of a genome graph -- the space required to store the nodes, their labels and edges -- affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. The size of the graph also affects the size of the graph index that is used to speed up the alignment. This raises the need for approaches to construct space-efficient genome graphs.

Continue reading...

Laura H. Tung and Carl Kingsford (2021) Practical selection of representative sets of RNA-seq samples using a hierarchical approach. ISMB 2021.

Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools.

Continue reading...

February 5, 2021 — CPCB Ph.D. student Yutong Qiu has been awarded an SCS Cancer Research Fellowship. Yutong works on computational methods for understanding the human genome, including methods to identify variants within single genomes and populations of genomes. Her recent work is focused on construction and use of genome graphs for applications in cancer, especially more accurate subtyping of cancers from genomic features.

Continue reading...

February 4, 2021 — Carl will receive the Herbert A. Simon Professorship of Computer Science in a virtual ceremony at 5:30 p.m. on Thursday, Feb. 4, 2021.

Continue reading...

Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2021) Sequence-specific minimizers via polar sets. ISMB 2021.

Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size.

Continue reading...

Avi Srivastava, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Carl Kingsford, and Rob Patro (2021) Accounting for fragments of unexpected origin improves transcript quantification in RNA-seq simulations focused on increased realism. bioRxiv, doi: https://doi.org/10.1101/2021.01.17.426996.

Transcript and gene quantification is the first step in many RNA-seq analyses. While many factors and properties of experimental RNA-seq data likely contribute to differences in accuracy between various approaches to quantification, it has been demonstrated (1) that quantification accuracy generally benefits from considering, during alignment, potential genomic origins for sequenced fragments that reside outside of the annotated transcriptome.

Continue reading...

Natalie Sauerwald and Carl Kingsford (2021) Capturing the complexity of topologically associating domains through multi-feature optimization. bioRxiv.

The three-dimensional structure of human chromosomes is tied to gene regulation and replication timing, but there is still a lack of consensus on the computational and biological definitions for chromosomal substructures such as topologically associating domains (TADs). TADs are described and identified by various computational properties leading to different TAD sets with varying compatibility with biological properties such as boundary occupancy of structural proteins.

Continue reading...

Cong Ma, Hongyu Zheng, and Carl Kingsford (2019) Finding ranges of optimal transcript expression quantification in cases of non-identifiability. RECOMB 2021.

Current expression quantification methods suffer from a fundamental but under-characterized type of error: the most likely estimates for transcript abundances are not unique. Probabilistic models are at the core of current quantification methods. The scenario where a probabilistic model has multiple optimal solutions is known as non-identifiability.

Continue reading...

2020

October 12, 2020 — Cong Ma has successfully defended her Ph.D. Congratulations Dr. Ma! Cong will join Princeton University as a postdoc.

Cong Ma (2020) Detecting anomalies in inferred transcript sequences and expression from RNA-seq. Ph.D. Thesis.

Anomalies are data points that do not follow established or expected patterns. When measuring gene expression, anomalies in RNA-seq are observations or pat- terns that cannot be explained by the inferred transcript sequences or expressions. Transcript sequences and expression are key indicators for cell status and are used in many phenotypic and disease analyses.

Continue reading...

Yihang Shen and Carl Kingsford (2020) Probabilistic method corrects previously uncharacterized Hi-C artifact. bioRxiv.

Three-dimensional chromosomal structure plays an important role in gene regulation. Chromosome conformation capture techniques, especially the high-throughput, sequencing-based technique Hi-C, provide new insights on spatial architectures of chromosomes. However, Hi-C data contains artifacts and systemic biases that substantially influence subsequent analysis.

Continue reading...

Natalie Sauerwald (2020) Algorithms for the study of chromosomal structure variability . Ph.D. Dissertation.

October 5, 2020 — The last two decades have introduced several experimental methods for studying three-dimensional chromosome structure, opening up a new dimension of genomics. Studies of these new data types have shown great promise in explaining some of the open questions in gene regulation, but the experiments are indirect and imperfect measurements of the underlying structure, requiring rigorous computational methods.

Continue reading...

Cong Ma, Hongyu Zheng, and Carl Kingsford (2020) Exact Transcript Quantification Over Splice Graphs. In Proceedings of WABI 2020; journal version in Alg. Mol. Bio. 2021.

The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript.

Continue reading...

Minh Hoang and Carl Kingsford (2020) Optimizing Dynamic Structures with Bayesian Generative Search. ICML 2020.

Kernel selection for kernel-based model is a difficult problem. Exact solutions are, in general, prohibitively expensive to solve for due to the NP-hard nature of these combinatorial tasks. In addition, gradient-based optimization techniques are inapplicable due to the non-differentiability of the objective function. As such, many state-of-the-art solutions for this problem resorts to heuristic search and gradient-free optimization.

Continue reading...

Trevor Frisby, Shawn James Baker, Guillaume Marçais, Quang Minh Hoang, Carl Kingsford, and Christopher James Langmead (2020) Harvestman: A framework for hierarchical feature learning and selection from whole genome sequencing data.. BMC Bioinformatics 22:174 (2021).

We present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, the largest publicly available collection of whole genome sequences.

Continue reading...

Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2020) Improved design and analysis of practical minimizers. Proceedings of ISMB 2020 (Bioinformatics).

Minimizers are methods to sample k-mers from a sequence, with the guarantee that similar set of k-mers will be chosen on similar sequences. It is parameterized by the k-mer length k, a window length w and an order on the k-mers. Minimizers are used in a large number of softwares and pipelines to improve computation efficiency and decrease memory usage.

Continue reading...

Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2020) Lower density selection schemes via small universal hitting sets with short remaining path length. Proceedings of RECOMB 2020 (preprint: arXiv:2001.06550 [cs.DS]).

Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size.

Continue reading...

Natalie Sauerwald, Akshat Singhal, and Carl Kingsford (2020) Analysis of the structural variability of topologically associated domains as revealed by Hi-C. Nuc. Acids Res. Genomics and Bioinformatics 2(1):lqz008 (2020).

Three-dimensional chromosome structure plays an integral role in gene expression and regulation, replication timing, and other cellular processes. Topologically associated domains (TADs), building blocks of chromosome structure, are genomic regions with higher contact frequencies within the region than outside the region. A central question is the degree to which TADs are conserved or vary between conditions.

Continue reading...

2019

Prashant Pandey, Yinjie Gao, and Carl Kingsford (2019) VariantStore: A Large-Scale Genomic Variant Search Index. Genome Biology *22*: 231 (2021).

The ability to efficiently query genomic variation data from thousands of samples is critical to achieve the full potential of many medical and scientific applications such as personalized medicine. We present VariantStore, a system for efficiently indexing and searching millions of genomic variants across thousands of samples.

Continue reading...

Cong Ma and Carl Kingsford (2019) Estimating mutual information under measurement error. bioRxiv 852384.

Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases.

Continue reading...

November 10, 2019 — Congratulations to Natalie Sauerwald, 5th-year CPCB Ph.D. student, for having one of her papers:

N Sauerwald and C Kingsford. Quantifying the similarity of topological domains across normal and cancer human cell types, Bioinformatics (ISMB) 34(13):i475-i483 (2018).

selected as one of the “Top 10 Reading Papers“ at RECOMB/ISCB Regulatory & Systems Genomics 2019.

The paper developed a method to compare genome structure

View source

Natalie Sauerwald and Carl Kingsford (2019) A statistical method for identifying consistently important features across samples. BioRxiv.

In many applications, features with consistently high measurements across many samples are particularly meaningful and useful for quality control or biological interpretation. Identification of these features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values.

Continue reading...

Cong Ma and Carl Kingsford (2019) Detecting, categorizing, and correcting coverage anomalies of RNA-seq quantification. Cell Systems *9*:1-11.

Due to incomplete reference transcriptomes, incomplete sequencing bias models, or other modeling defects, algorithms to infer isoform expression from RNA-seq sometimes do not accurately model expression. We present a computational method to detect instances where a quantification algorithm could not completely explain the input reads. Our approach identifies regions where the read coverage significantly deviates from expectation.

Continue reading...

HuBMAP Consortium (2019) The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature *574*:187-192 (2019).

Transformative technologies are enabling the construction of three-dimensional maps of tissues with unprecedented spatial and molecular resolution.

Continue reading...

September 10, 2019 — The paper ‘Practical Universal k-mer Sets for Minimizer Schemes’, was awarded best student paper at the ACM-BCB 2019 conference in Niagara Falls, NY.

Continue reading...

Scott A. Keith, Rory Eutsey, Heewook Lee, Brad Solomon, Stacie Oliver, Carl Kingsford, N. Luisa Hiller, and Brooke M. McCartney (2019) Identification of Microbiota-Induced Gene Expression Changes in the Drosophila melanogaster Head. bioRxiv.

Symbiotic microorganisms exert multifaceted impacts on the physiology of their animal hosts. Recent discoveries have shown the gut microbiota influence host brain function and behavior, but the host and microbial molecular factors required to actuate these effects are largely unknown.

Continue reading...

Yutong Qiu, Cong Ma, Han Xie, and Carl Kingsford (2019) Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 18:1--18:19 (2019); journal version in Alg Mol Bio (2020).

Transcriptomic structural variants (TSVs) --- structural variants that affect expressed regions --- are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs.

Continue reading...

Avi Srivastava, Laraib Malik, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro (2019) Alignment and mapping methodology influence transcript abundance estimation. Genome Biology 21: 239..

The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. We investigate the effect of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential gene expression analysis.

Continue reading...

Dan DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marcais (2019) Practical universal k-mer sets for minimizer schemes. To appear in ACM-BCB 2019.

Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same.

Continue reading...

Hongyi Xin, Mingfu Shao, and Carl Kingsford (2019) Context-Aware Seeds for Read Mapping. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 15:1--15:13 (2019) (journal version in Alg Mol Bio).

Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers.

Continue reading...

Laura H. Tung, Mingfu Shao, and Carl Kingsford (2019) Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads. Genome Biology *20*:287.

Third-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates.

Continue reading...

Quang Minh Hoàng, Nghia Hoang, Bryan Low, and Carl Kingsford (2019) Collective Model Fusion for Multiple Black-Box Experts. ICML PMLR 97:2742-2750, 2019.

Model fusion is a fundamental problem in collective machine learning (ML) where independent experts with heterogeneous learning architectures are required to combine expertise to improve predictive performance. This is particularly challenging in information-sensitive domains (e.g., medical records in health-care analytics) where experts do not have access to each other's internal architecture and local data.

Continue reading...

Guillaume Marçais, Brad Solomon, Rob Patro, and Carl Kingsford (2019) Sketching and Sublinear Data Structures in Genomics. Annual Review of Biomedical Data Science *2*:93-118.

Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.

View source

April 11, 2019 — Laura Tung, a Ph.D. student in the CPCB Computational Biology Program at Carnegie Mellon University has been awarded a NSF Graduate Research Fellowship. This highly competitive fellowship will partially fund her research for 3 years.

Congratulations!

View source

Vaibhav Rajan, Carl Kingsford, and Xiuwei Zhang (2019) Maximum Likelihood Reconstruction of Ancestral Networks by Integer Linear Programming. bioRxiv.

The evolutionary history of biological networks enables deep functional and evolutionary understanding of various bio-molecular processes. Network growth models, such as the Duplication-Mutation with Complementarity (DMC) model, provide a principled approach to characterizing the evolution of protein-protein interactions (PPI) based on duplication and divergence. Current methods for model-based ancestral network reconstruction, primarily use greedy heuristics and yield sub-optimal solutions.

Continue reading...

Dan DeBlasio, Kwanho Kim, and Carl Kingsford (2019) More accurate transcript assembly via parameter advising. The 2019 ICML Workshop on Computational Biology; journal version in Journal of Computational Biology.

Computational tools used for genomic analyses are becoming more accurate but also increasingly sophisticated and complex. This introduces a new problem in that these pieces of software have a large number of tunable parameters which often have a large influence on the results that are reported.

Continue reading...

Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford (2019) Locality sensitive hashing for the edit distance. Bioinformatics 35(14):i127-i135 (ISMB) 2019.

Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment.

Continue reading...

Natalie Sauerwald, Yihang Shen, and Carl Kingsford (2019) Topological data analysis reveals principles of chromosome structure throughout cellular differentiation. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 23:1--23:16 (2019).

Topological data analysis (TDA) is a mathematically well-founded set of methods to derive robust information about the structure and topology of data. It has been applied successfully in several biological contexts.

Continue reading...

2018

Heewook Lee and Carl Kingsford (2018) Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery. Genome Biology *19*:16.

View source

Heewook Lee and Carl Kingsford (2018) Accurate Assembly and Typing of HLA using a Graph-Guided Assembler Kourami. HLA Typing, pages 235-247.

View source

Cong Ma, Mingfu Shao, and Carl Kingsford (2018) SQUID: transcriptomic structural variation detection from RNA-seq. Genome Biology *19*:52.

View source

Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2018) Asymptotically optimal minimizers schemes. Bioinformatics (ISMB) *34*(13):i13-i22.

View source

Natalie Sauerwald and Carl Kingsford (2018) Quantifying the similarity of topological domains across normal and cancer human cell types. Bioinformatics (ISMB) *34*(13):i475-i483.

View source

Brad Solomon and Carl Kingsford (2018) Improved search of large transcriptomic sequencing databases using split sequence bloom trees. Journal of Computational Biology *25*(7):755-765.

View source

Hao Wang, Carl Kingsford, and C Joel McManus (2018) Using the Ribodeblur pipeline to recover A-sites from yeast ribosome profiling data. Methods *137*:67-70.

View source

2017

November 10, 2017 — Dan DeBlasio has published a book with his Ph.D. advisor covering his research on learning the best way to run algorithms for multiple sequence alignment.

Dan is a member of the Kingsford Group and a Lane Fellow in the Computational Biology Department at Carnegie Mellon.

View source

August 16, 2017 — Response to Blog Post About Salmon.

View source

August 10, 2017 — A Ph.D. might span ~240 weekends. If we assume that 1 weekend / month is spent relaxing, watching movies and TV and running errands, 1 weekend / month is spent working 😁, and 4 weekends per year are spent traveling, that leaves about 100 weekends to explore Pittsburgh, if you are a student here.

Here’s a list of 100 things to do during those weekends.

View source

April 16, 2017 — Lane Fellow Mingfu Shao was interviewed on the inaugural episode of the The Bioinformatics Chat podcast! He talks about his new transcriptome assembler, Scallop.

Listen to hear all about Mingfu’s work!

You can read more about Scallop here.

View source

Guillaume Marçais, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford (2017) Improving the performance of minimizers and winnowing schemes. Bioinformatics (ISMB) *33*(14):i110-117.

View source

Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods *14*:417-419.

View source

David Pellow, Darya Filippova, and Carl Kingsford (2017) Improving Bloom filter performance on sequence data using k-mer Bloom filters. Journal of Computational Biology *24*(6):547-557.

View source

Mingfu Shao and Carl Kingsford (2017) Theory and A Heuristic for the Minimum Path Flow Decomposition Problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics *16*(2):658-670.

View source

Mingfu Shao and Carl Kingsford (2017) Accurate assembly of transcripts through phase-preserving graph decomposition. Nature Biotechnology *35*:1167-1169.

View source

2016

Nitish Gupta, Komal Sanjeev, Tim Wall, Carl Kingsford, and Rob Patro (2016) Efficient index maintenance under dynamic genome modification. ArXiv Preprint ArXiv:1604.03132.

View source

Hiren Karathia, Carl Kingsford, Michelle Girvan, and Sridhar Hannenhalli (2016) A pathway-centric view of spatial proximity in the 3D nucleome across cell lines. Scientific Reports *6*:39279.

View source

Carl Kingsford (2016) Teaching Computation to Biologists (book review). Computing in Science & Engineering *18*(1):4-5.

View source

Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford (2016) Compact universal k-mer hitting sets. International Workshop on Algorithms in Bioinformatics, pages 257-268.

View source

Emre Sefer and Carl Kingsford (2016) Diffusion archeology for diffusion progression history reconstruction. Knowledge and Information Systems, pages 1-25.

View source

Brad Solomon and Carl Kingsford (2016) Fast search of thousands of short-read sequencing experiments. Nature Biotechnology *34*:300-302.

View source

Hao Wang, Joel McManus, and Carl Kingsford (2016) Isoform-level ribosome occupancy estimation guided by transcript abundance with Ribomap. Bioinformatics *32*(12):1880-1882 .

View source

2015

Carl Kingsford and Rob Patro (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics *31*(12):1920-1928 .

View source

Rob Patro and Carl Kingsford (2015) Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics *31*(17):2770-2777 .

View source

Emre Sefer and Carl Kingsford (2015) Semi-nonparametric modeling of topological domain formation from epigenetic data. International Workshop on Algorithms in Bioinformatics, pages 148-161.

View source

Hongyi Xin, Sunny Nahar, Richard Zhu, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu (2015) Optimal seed solver: optimizing seed selection in read mapping. Bioinformatics *32*(11):1632-1642.

View source

2014

Darya Filippova, Rob Patro, Geet Duggal, and Carl Kingsford (2014) Identification of alternative topological domains in chromatin. Algorithms for Molecular Biology, 9*14*.

View source

2013

Geet Duggal, Rob Patro, Emre Sefer, Hao Wang, Darya Filippova, Samir Khuller, and Carl Kingsford (2013) Resolving spatial inconsistencies in chromosome conformation measurements. Algorithms for Molecular Biology *8*:8.

View source

Geet Duggal, Hao Wang, and Carl Kingsford (2013) Higher-order chromatin domains link eQTLs with the expression of far-away genes. Nucleic Acids Research *42*(1):87-96 .

View source

Darya Filippova, Rob Patro, Geet Duggal, and Carl Kingsford (2013) Multiscale identification of topological domains in chromatin. International Workshop on Algorithms in Bioinformatics, pages 300-312.

View source

Rob Patro and Carl Kingsford (2013) Predicting protein interactions via parsimonious network history inference. Bioinformatics *29*(13):i237-i246.

View source

2012

Geet Duggal and Carl Kingsford (2012) Graph rigidity reveals well-constrained regions of chromosome conformation embeddings. BMC Bioinformatics *13*:241.

View source

Darya Filippova, Aashish Gadani, and Carl Kingsford (2012) Coral: an integrated suite of visualizations for comparing clusterings. BMC Bioinformatics *13*:276.

View source

Rob Patro and Carl Kingsford (2012) Global network alignment using multiscale spectral signatures. Bioinformatics *28*(23):3105-3114.

View source

Rob Patro, Emre Sefer, Justin Malin, Guillaume Marçais, Saket Navlakha, and Carl Kingsford (2012) Parsimonious reconstruction of network evolution. Algorithms for Molecular Biology *7*:25.

View source

2011 and Before

David R Kelley and Carl Kingsford (2011) Extracting between-pathway models from E-MAP interactions using expected graph compression. Journal of Computational Biology *18*(3):379-390.

View source

Carl Kingsford, Elena Zaslavsky, and Mona Singh (2011) A cost-aggregating integer linear program for motif finding. Journal of Discrete Algorithms *9*(4):326-334.

View source

Guillaume Marçais and Carl Kingsford (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics *27*(6):764-770.

View source

Saket Navlakha and Carl Kingsford (2011) Network archaeology: uncovering ancient networks from present-day interactions. PLoS Computational Biology *7*(4):e1001119.

View source

Emre Sefer and Carl Kingsford (2011) Metric labeling and semi-metric embedding for protein annotation prediction. International Conference on Research in Computational Molecular Biology, pages 392-407.

View source

Carl Kingsford, Michael C Schatz, and Mihai Pop (2010) Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics *11*:21.

View source

Niranjan Nagarajan and Carl Kingsford (2010) GiRaF: robust, computational identification of influenza reassortments via graph mining. Nucleic Acids Research *39*(6):e34 .

View source

Saket Navlakha and Carl Kingsford (2010) The power of protein interaction networks for associating genes with diseases. Bioinformatics *26*(8):1057-1063.

View source

Saket Navlakha and Carl Kingsford (2010) Exploring biological network dynamics with ensembles of graph partitions. Pacific Symposium on Biocomputing (PSB), pages 166-177.

View source

Carl Kingsford and Guillaume Marçais (2009) Vertices of degree k in edge-minimal, k-edge-connected graphs. ArXiv Preprint ArXiv:0905.1064.

View source

Carl Kingsford and Guillaume Marçais (2009) A synthesis for exactly 3-edge-connected graphs. ArXiv Preprint ArXiv:0905.1053.

View source

Carl Kingsford, Niranjan Nagarajan, and Steven L Salzberg (2009) 2009 Swine-origin influenza A (H1N1) resembles previous influenza isolates. Plos One *4*(7):e6402 .

View source

Saket Navlakha, Michael C Schatz, and Carl Kingsford (2009) Revealing biological modules via graph summarization. Journal of Computational Biology *16*(2):253-264.

View source

Carl Kingsford and Steven L Salzberg (2008) What are decision trees?. Nature Biotechnology *26*(9):1011-1013 .

View source

Carl Kingsford, Elena Zaslavsky, and Mona Singh (2006) A compact mathematical programming formulation for DNA motif finding. Annual Symposium on Combinatorial Pattern Matching, pages 233-245.

View source

Bernard Chazelle, Carl Kingsford, and Mona Singh (2004) The inapproximability of side-chain positioning. Technical Report. Princeton University, Princeton.

View source

Bernard Chazelle, Carl Kingsford, and Mona Singh (2004) A semidefinite programming approach to side chain positioning with new rounding strategies. INFORMS Journal on Computing *16*:380-392.

View source

Carleton L Kingsford, Bernard Chazelle, and Mona Singh (2004) Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinformatics *21*:1028-1039.

View source

View source