Search Kingsford Group

title text date wordCount minutes
Carl Kingsford [Image Omitted] Carl Kingsford ============== carlk@cs.cmu.edu Herbert A. Simon Professor of Computer Science Ray and Stephanie Lane Computational Biology Department https://cbd.cmu.edu School of Computer Science https://www.cs.cmu.edu Carnegie Mellon University https://www.cmu.edu Affiliate Faculty, Machine Learning Department https://www.ml.cmu.edu Director, Center for Machine Learning in Health https://cmlh.org Director, Center for Innovation in Health https://www.cs.cmu.edu/cih/ Co-Director, Joint CMU-Univ Pittsburgh Ph.D. Program in Computational Biology https://compbio.cmu.edu [Image Omitted] CEO, Ocean Genomics https://oceangenomics.com Twitter/X https://twitter.com/ckingsford Google Scholar https://scholar.google.com/citations?user=V_cvqKcAAAAJ&hl=en Room 7719, Gates-Hillman ComplexCarnegie Mellon University5000 Forbes AvePittsburgh, PA 15213 My research group is interested in advancing computer science and machine learning methodologies to extract insight from biological data. We currently focus on the following classes of problems: - *Genomics & genome assembly*: RNA-seq expression quantification; genome assembly; large-scale sequence search, sketching, etc. This work is currently supported by NIH grant 1R01HG012470. It was previously supported by NIH grant 1R21HG006913, NSF grant CCF-1319998, a Data-Driven Investigator grant from the Gordon and Betty Moore Foundation NIH grant R01GM122935, and an award from The Shurl and Kay Curci foundation. - *Pan-genomics and genome graphs*: Storing, searching, and using large collections of genomes and transcriptomes from many individuals. This work is supported by NSF grant III-2232121. - *Automatically learning algorithms and wetlab protocols*: Scheduling automated experimentation, hyperparameter optimization, autoML, and automated algorithm design. Supported by an award from Schmidt Sciences. Previous research interests include: - Chromatin structure and function: Algorithms for determining the spatial organization of eukaryotic genomes from Chromosome Conformation Capture data. Previously supported by NIH grant R01HG007104. - Viral evolution: Reassortment in the influenza genome. This work was supported by NIH grant 1R21AI085376. - Protein interactions and networks: Evolution of interactions; protein function prediction; clustering within networks; protein structure prediction. This work was supported by NSF grant EF-0849899 and by NSF grant CCF-1053918/CCF-1256087 (CAREER award). Disclosure: I am a co-founder of Ocean Genomics, Inc. https://oceangenomics.com 05/24/9999 356 1.8
Haotian Teng successfully defends his dissertation Haotian Teng successfully defends his dissertation ================================================== August 26, 2024 Haotian Teng has successfully defended his Ph.D. dissertation. Teng was co-advised by Carl Kingsford and Ziv Bar-Joseph. Congratulations Dr. Teng! Haotian Teng (2024) Haotian Teng successfully defends his dissertation. Ph.D. Thesis Tech Report. Many fundamental biological tasks require unsupervised learning where groundtruth labels are unavailable, but shallow unsupervised machine learning methods have poor performance on these tasks due to the complexity of the problem. With their strong representation power, deep learning models have been widely applied to solve challenging tasks; however, they usually require large amounts of labeled data. To take advantage of the strong representation power of deep learning while applying it to unsupervised tasks, we developed several hybrid models that combine deep neural networks and unsupervised machine learning models. We used these models to improve performance on unsupervised biological tasks, including cell type clustering, basecalling, and molecular docking, demonstrating how a hybrid model can be used in solving spatial tasks (cell-type clustering), temporal tasks (basecalling), and spatial-temporal tasks (molecular docking). First, we present an unsupervised cell type clustering model for recently developed single-molecule, spatially resolved transcriptomics data, where a deep neural network (NN) encoder is used to generate low-dimensional, Gaussian-distributed gene embeddings, which are then combined with spatial relationships using a Gaussian-Multinomial Mixture Model developed by us to predict cell type clustering. The second problem we try to tackle is to call m6A methylated bases in RNA generated from long-read sequencing. m6A modification plays essential roles in regulating gene expression, but an efficient way to detect it systemically is lacking. The long-read sequencing from Oxford Nanopore Technologies has been shown to be sensitive to post-transcriptional modification, but an m6A sensitive basecaller for directly detecting this subtle sequencing signal has not yet been developed. We used a CNN-RNN (Convolutional-Recurrent Neural network) model previously developed by us for canonical basecalling to train a Non-homogeneous HMM (NHMM) where its transition matrix is conditioned on the deep NN output. Using the hybrid synthetically m6A methylation data sampled from the NHMM, we were able to train a NN basecaller to call m6A base. We applied our method to call the methylome in Yeast and Human RNA without the need for knock-out comparison data. For the third application, we developed a deep generative model with a SE(3)-equivariant diffusion transformer to address pocket-based molecular docking, where the 3D structure of the ligand is to be predicted given the protein pocket. We applied our model to a virtual screening task to select effective JAK2 inhibitors, identifying 13 candidate compounds with high affinity scores confirmed by wet lab assays from a total of 9,137 drugs, two of which are new molecules that have never been reported before. 08/26/2024 470 2.4
Journal version of "How much data is sufficient to learn high-performing algorithms?" Journal version of "How much data is sufficient to learn high-performing algorithms?" ===================================================================================== Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2024) Journal version of "How much data is sufficient to learn high-performing algorithms?". J. ACM 71(5):32, pages 1--58 . The journal version of our 2021 STOC paper "How much data is sufficient to learn high-performing algorithms" has been accepted to the Journal of the ACM. The STOC 2021 version can be found here. https://dl.acm.org/doi/10.1145/3406325.3451036 See also this post. https://kingsfordlab.cbd.cmu.edu/2019-balcan-sampcomplex.html link: DOI https://doi.org/10.1145/367627 link: Preprint https://arxiv.org/abs/1908.02894 06/26/2024 122 0.6
Voices: How has the AI boom impacted algorithmic biology? Voices: How has the AI boom impacted algorithmic biology? ========================================================= Mona Singh, Cenk Sahinalp, Jianyang Zeng, Wei Vivian Li, Carl Kingsford, Qiangfeng Zhang, Teresa Przytycka, Joshua Welch, Jian Ma, and Bonnie Berger (2024) Voices: How has the AI boom impacted algorithmic biology?. _Cell Systems_ *15*(6): P483-487. The AI boom has affected algorithmic computational biology by further bringing traditional algorithmic thinking and ML and AI techniques closer together. One area where this is particularly true is in the field of automated algorithm design, where AI is used to inform or predict aspects of the design of an algorithm. Many traditional algorithmic tasks, such as genomic sequence alignment or transcript assembly, are implemented as highly parameterized algorithms, where the settings of these parameters can significantly affect the accuracy of the output. These can be hard to set by hand, requiring expertise, time, and a way to assess accuracy. This work can be avoided, while simultaneously increasing reproducibility and accuracy, through new AI approaches that use large datasets to train AI models to predict input-specific parameters for traditional, hand-designed algorithms. Such systems are especially useful when analyzing large, heterogeneous collections of samples where hand selection of optimal parameters is not feasible. Future work in this area involves deeper co-design of parameterized algorithms and AI systems to enable AI-driven optimization, possibly explicitly supporting the selection from among various large-scale algorithmic changes. An additional challenge is to codify the definition of the desired output to be able to optimize parameters for biological insight and utility. This is related to a third challenge, which is avoiding overfitting: when selecting from a large parameter space, trivial solutions that technically optimize the quality of the output but that are not useful can be obtained (for example, if the optimization metric is number of fragments aligned, selecting parameters that simply align all fragments poorly would satisfy the AI but not the biologist). link: DOI https://doi.org/10.1016/j.cels.2024.05.008 06/20/2024 331 1.7
Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants ================================================================================================== Yutong Qiu, Yihang Shen, and Carl Kingsford (2024) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. Alg. Mol. Biol.. The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/. link: DOI https://doi.org/10.1186/s13015-024-00262-6 link: Preprint https://arxiv.org/abs/2305.10577 link: Paper https://link.springer.com/article/10.1186/s13015-024-00262-6 link: PDF pdf/2024-qiu_shen-gtedjournal.pdf link: Code https://github.com/Kingsford-Group/gtednewilp/ link: BibTeX bibtex/2024-qiu_shen-gtedjournal.bib 04/28/2024 305 1.5
11% of RECOMB24 papers co-authored by current or former Kingsford group members 11% of RECOMB24 papers co-authored by current or former Kingsford group members =============================================================================== April 24, 2024 Six out of 57 accepted RECOMB24 papers are authored by current or former Kingsford group members. These include 2 papers authored by current Ph.D. students, and 4 authored by former trainees of the group. [Image Omitted] - Improving Hi-C contact matrices using genome graphs. Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang and Carl Kingsford - A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. Shane Elder, Quang Minh Hoang, Mohsen Ferdosi and Carl Kingsford -Inferring allele-specific copy number aberrations and tumor phylogeography using spatially resolved transcriptomics. Cong Ma, Metin Balaban, Clara Liu, Siqi Chen, Li Ding and Ben Raphael - Mapping the topography of spatial gene expression with interpretable deep learning. Uthsav Chitra, Brian Arnold, Hirak Sarkar, Cong Ma, Sereno Lopez-Darwin, Kohei Sanno and Ben Raphael - Meta-colored compacted de Bruijn graphs. Giulio Ermanno Pibiri, Jason Fan and Robert Patro - Accurate Assembly of Circular RNAs with TERRACE. Tasfia Zahin, Qian Shi, Xiaofei Carl Zang and Mingfu Shao Additionally, five papers, including some of the above, have an author currently affiliated with CMU. The acceptance rate was 16.5% 04/24/2024 201 1.0
Yihang Shen successfully defends his dissertation Yihang Shen successfully defends his dissertation ================================================= April 10, 2024 Yihang Shen has successfully defended his Ph.D. Congratulations Dr. Shen! Yihang Shen (2024) Yihang Shen successfully defends his dissertation. Ph.D. Thesis (CMU-CB-24-100). Hyper-parameters play a crucial role in the efficacy of machine learning and computational biology tools. Their optimal selection profoundly impacts tool per- formance, yet manual tuning of these hyper-parameters can be a laborious process, demanding extensive domain knowledge. Therefore, the development of algorithms for automatic hyper-parameter tuning is important. While numerous strategies for hyper-parameter tuning in machine learning exist, certain challenges remain inade- quately addressed. These include tuning hyper-parameters (1) in contexts where the search space exhibits unique characteristics, such as high dimensionality; and (2) for computational biology tools, where optimal settings are closely tied to the specific biological sample being analyzed. In response to these challenges, this dissertation focuses on the development of novel algorithms designed for the automatic tuning of hyper-parameters across various tool types. In the first part, we focus on developing new Bayesian Optimization methods for hyper-parameter tuning in high-dimensional search spaces. Bayesian Optimization is a machine learning method that is widely used in various scenarios, including the tuning of hyper-parameters. Yet, its application in high-dimensional search spaces presents a significant challenge that remains to be fully addressed. To overcome this, we develop a new high-dimensional Bayesian Optimization framework based on the concept of variable selection and show that the new method is more computationally efficient than previous high-dimensional Bayesian Optimization methods. In the second part, we focus on developing hyper-parameter tuning algorithms for transcript assemblers. Transcript assemblers are tools for reconstructing expressed transcripts from the reads in a given RNA-seq sample. Given that these tools have many tunable hyper-parameters, and their optimal configurations greatly depend on the characteristics of the input sample, it is crucial to develop automatic tuning methods that adapt to different inputs. We develop the first adaptive, sample-specific hyper-parameter tuning system for transcript assemblers. This innovation marks an important advancement towards more precise transcript assembly, which in turn will enhance downstream RNA-seq analyses such as transcript quantification. In high-throughput sequencing biological data analysis, the initial step is to align reads to a linear reference genome to determine their genomic locations. Recog- nizing that genetic variations differ among biological samples, it is crucial to use a sample-specific reference genome rather than a default one. Therefore, automati- cally deducing the sample-specific reference genome directly from the sample data becomes an important problem. It is a unique hyper-parameter tuning problem, where the reference genome represents the hyper-parameter and the search space encompasses various potential genomes. In the last part, we focus on developing algorithms to infer genomes from Hi-C, a distinct type of high-throughput sequencing data providing insights into the spatial arrangement of chromosomes. We show that using an inferred genome improves downstream Hi-C analyses, thereby contributing to a more profound understanding of chromosomal organization and function. 04/10/2024 514 2.6
Carl Kingsford named ISCB Fellow Carl Kingsford named ISCB Fellow ================================ March 8, 2024 Carl Kingsford, Herbert A. Simon Professor of Computer Science in the Ray and Stephanie Lane Computational Biology Department, has been elected as a Fellow of the International Society of Computational Biology (ISCB). The ISCB notes that Carl has been chosen for this honor because he “is a trailblazer in computational molecular biology, showcasing sustained innovation in scalable algorithmic approaches.” Carl’s research is focused on developing new, efficient algorithms and AI methods for extracting knowledge from large biological data sets, particularly high-throughput DNA and RNA sequencing data. He has worked on algorithms for accurately quantifying gene expression, identifying compact regions of chromatin, and large-scale sequence search. Carl’s group continues to push the boundaries of how computer science can drive scientific discovery, including recently developing new pan-genomic analysis algorithms using genome graphs, using reinforcement learning to optimize experimental protocols to drive automated experimentation, and creating new meta-learning techniques for adapting deep neural networks to new tasks with limited data. He is also director of CMU’s Center for Machine Learning and Health (CMLH). The ISCB Fellows Program, introduced in 2009, is a prestigious recognition within the field of computational biology and honors members that have distinguished themselves through outstanding contributions to the field, provided to only ½ of a percent of the previous year’s ISCB membership. The new Fellows will be introduced during this year’s ISMB conference in July 2024. For more information on this year’s fellows, visit ISCB’s website. https://www.iscb.org/iscb-news-items/5232-march-8-2024-iscb-congratulates-and-introduces-the-2024-class-of-fellows 03/08/2024 275 1.4
k-nonical space: sketching with reverse complements k-nonical space: sketching with reverse complements =================================================== Guillaume Marçais, C.S. Elder, and Carl Kingsford (2024) k-nonical space: sketching with reverse complements. _Bioinformatics_. Sequences equivalent to their reverse complements (i.e., double-stranded DNA) have no equivalent in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms. [Image Omitted] Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: the canonical representation (k-nonical space). The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (“sketching deserts”) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accomodate for these effects: (1) a new procedure that adapts existing sketching methods to k-nonical space and (2) an optimization procedure to directly design new sketching methods for k-nonical space. link: DOI https://doi.org/10.1101/2024.01.25.577301 link: Preprint https://www.biorxiv.org/content/10.1101/2024.01.25.577301v1 link: Code https://github.com/Kingsford-Group/mdsscope 01/27/2024 275 1.4
Adaptive, sample-specific parameter selection for more accurate transcript assembly Adaptive, sample-specific parameter selection for more accurate transcript assembly =================================================================================== Yihang Shen, Zhiwen Yan, and Carl Kingsford (2024) Adaptive, sample-specific parameter selection for more accurate transcript assembly. bioRxiv. Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples. Manually tuning parameters for each sample is time consuming and requires specialized experience. Therefore, developing an automated system that can advise good parameter settings for individual samples becomes an important problem. Results: Using Bayesian optimization and contrastive learning, we develop a new automated parameter advising system for transcript assembly that can generate sets of sample-specific good parameter candidates. Our framework achieves efficient sample-specific parameter advising by learning parameter knowledge from a representative set of existing RNA-seq samples and transferring the knowledge to unseen samples. We use Scallop and StringTie, two well-known transcript assemblers, to test our framework on two collections of RNA-seq samples. Results show that our new parameter advising system significantly outperforms the previous advising method in each dataset and each transcript assembler. link: DOI https://doi.org/10.1101/2024.01.25.577290 01/27/2024 238 1.2
Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation ====================================================================== Minh Hoang and Carl Kingsford (2024) Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation. ICLR 2024. We tackle the problem of meta-learning across heterogenous tasks. This problem seeks to extract and generalize transferable meta-knowledge through streaming task sets from a multi-modal task distribution. The extracted meta-knowledge can be used to create predictors for new tasks using a small number of labeled samples. Most meta-learning methods assume a homogeneous task distribution, thus limiting their generalization capacity when handling multi-modal task distributions. Recent work has shown that the generalization of meta-learning depends on the similarity of tasks in the training distribution, and this has led to many clustering approaches that aim to detect homogeneous clusters of tasks. However, these methods suffer from a significant increase in parameter complexity. To overcome this weakness, we propose a new heterogeneous meta-learning strategy that efficiently captures the multi-modality of the task distribution via modulating the routing between convolution channels in the network, instead of directly modulating the network weights. This new mechanism can be cast as a permutation learning problem. We further introduce a novel neural permutation layer based on the classical Benes routing network, which has sub-quadratic parameter complexity in the total number of channels, as compared to the quadratic complexity of the state-of-the-art Gumbel-Sinkhorn layer. We demonstrate our approach on various multi-modal meta-learning benchmarks, showing that our framework outperforms previous methods in both generalization accuracy and convergence speed. link: Paper https://openreview.net/forum?id=QiJuMJl0QS 01/16/2024 266 1.3
A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements =========================================================================================================== C.S. Elder, Minh Hoang, Mohsen Ferdosi, and Carl Kingsford. (2024) A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. RECOMB 2024. The Beltway and Turnpike problems entail the reconstruction of circular and linear one-dimensional point sets from unordered pairwise distances. These problems arise in computational biology when the measurements provide distances but do not associate those distances with the entities that gave rise to them. Such applications include molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes (since sequencing and mass spec technologies can give lengths or weights, usually without connecting them to end points). Practical algorithms for Turnpike are known when the distance measurements are accurate, but both problems become strongly NP-complete under any level of measurement uncertainty. This is problematic since all known applications experience some degree of uncertainty from uncontrollable factors. Traditional algorithms cope with this complexity by exploring a much larger solution space, leading to exponential blowup in terms of both time and space. To alleviate both issues, we propose a novel alternating optimization algorithm that is able to scale to large, uncertain distance sets with as many as 100,000 points. This algorithm is space and time-efficient, with each step running in $O(m log(m))$ time and requiring only $O( sqrt{m})$ work space for a distance set of size $m$. Evaluations of this approach on synthetic and partial digest data showcase improved accuracy and scalability in the presence of uncertain, duplicated, and missing distances. Our implementation of the algorithm is available here. link: DOI https://doi.org/10.1101/2024.02.15.580520 link: Preprint https://www.biorxiv.org/content/10.1101/2024.02.15.580520v1 link: Code https://github.com/Kingsford-Group/turnpikesolvermm 01/12/2024 308 1.5
Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework ================================================================================================== Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, and Carl Kingsford (2024) Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework. _Genome Research_, to appear.. Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this dataset to train an end-to-end neural network basecaller followed by fine-tuning on immunoprecipitation-based experimental data with label-smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly. link: DOI https://doi.org/10.1101/2024.01.06.574484 link: Preprint https://biorxiv.org/cgi/content/short/2024.01.06.574484v1 link: Code https://github.com/haotianteng/xron 01/08/2024 231 1.2
Improving Hi-C contact matrices using genome graphs Improving Hi-C contact matrices using genome graphs =================================================== Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang, and Carl Kingsford (2024) Improving Hi-C contact matrices using genome graphs. RECOMB 2024. Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes. [Image Omitted] However, the linear reference genome does not incorporate genetic variation information, which can lead to incorrect read alignments, especially when analyzing samples with substantial genomic differences from the reference such as cancer samples. Using genome graphs as the reference facilitates more accurate mapping of reads, however, new algorithms are required for inferring linear genomes from Hi-C reads mapped on genome graphs and constructing corresponding Hi-C contact matrices, which is a prerequisite for the subsequent steps of the Hi-C analysis such as identifying topologically associated domains and calling chromatin loops. We introduce the problem of genome sequence inference from Hi-C data mediated by genome graphs. We formalize this problem, show the hardness of solving this problem, and introduce a novel heuristic algorithm specifically tailored to this problem. We provide a theoretical analysis to evaluate the efficacy of our algorithm. Finally, our empirical experiments indicate that the linear genomes inferred from our method lead to the creation of improved Hi-C contact matrices. These enhanced matrices show a reduction in erroneous patterns caused by structural variations and are more effective in accurately capturing the structures of topologically associated domains. link: Preprint https://doi.org/10.1101/2023.11.08.566275 link: PDF pdf/2024-shen-graphhic.pdf link: Code https://github.com/Kingsford-Group/graphhic 11/13/2023 297 1.5
Sketching methods with small window guarantee using minimum decycling sets Sketching methods with small window guarantee using minimum decycling sets ========================================================================== Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2023) Sketching methods with small window guarantee using minimum decycling sets. Journal of Computational Biology. Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. In particular the window guarantee ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee corresponds to a Decycling Set, aka an unavoidable sets of k-mers. Any long enough sequence must contain a k-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set defines a sketching method by selecting the k-mers from the set. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their small size. Only two algorithms, by Mykkeltveit and Champarnaud, are known to generate two particular MDSs, although there is a vast number of alternative MDSs. We provide a simple method that allows one to explore the space of MDSs and to find sets optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. link: DOI https://doi.org/10.48550/arXiv.2311.03592 link: Preprint https://arxiv.org/abs/2311.03592 link: Paper https://doi.org/10.1089/cmb.2024.0544 11/09/2023 342 1.7
Minh Hoang successfully defends his dissertation Minh Hoang successfully defends his dissertation ================================================ [Image Omitted] October 16, 2023 Minh Hoang has successfully defended his Ph.D. dissertation, which is titled “Practical Methods for Automated Algorithm Design in Machine Learning and Computational Biology.” He will join Princeton University as a postdoc. Congratulations Dr. Hoang! Minh Hoang (2023) Minh Hoang successfully defends his dissertation. Ph.D. Thesis. Configuration tuning is an essential practice to achieve good performance with many computational methods. However, configuring complex and discrete algorithms often requires significant trial-and-error effort due to a lack of automated solutions. In large-scale systems where computational tasks are numerous and constantly changing in specificity, the repetitive cost of manual tuning becomes a major bottleneck that hinders scalability. Moreover, the absence of a systematic approach to configure deployment settings makes it challenging to replicate the obtained results in different deploying conditions. To address these problems, this thesis focuses on developing new data-driven automated algorithm design (AAD) frameworks in several classical and multi-task settings. Specifically, in the classical configuration tuning setting, we address the problems of kernel selection for Bayesian methods, and minimizer construction for biological sequence sketching. In the multi-task scenario, we address the problems of privacy-preserving neural architecture search for multiple clients, and meta-learning for parameter optimization in a heterogeneous task stream. In all of these problems, the variables to be optimized often have underlying discrete structures such as trees, graphs or permutations. Our contribution is a suite of reformulation techniques that result in efficient and accurate tuning methods for these configuration domains. Finally, we demonstrate the performance of our methods on practical scenarios and show that they have significantly outperformed state-of-the-art benchmarks. link: Paper http://reports-archive.adm.cs.cmu.edu/anon/usr0/ftp/home/anon/home/ftp/cbd/abstracts/20-101.html link: PDF pdf/2023-news-minhdefends.pdf 10/16/2023 311 1.6
Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC ============================================================================================= Hossein Asghari, Ehsan Haghshenas, Roby Thomas, Eric Schultz, Rob Patro, Stan Skrzypczak, and Carl Kingsford (2023) Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC. Cancer Res (2023) 83 (7_Supplement): 1400. link: DOI https://doi.org/10.1158/1538-7445.AM2023-1400 link: Paper https://doi.org/10.1158/1538-7445.AM2023-1400 06/21/2023 70 0.3
Molecular predictors and immunomodulatory role of dual checkpoint inhibitor blockade using ipilimumab/nivolumab in patients with extensive stage small cell lung cancer Molecular predictors and immunomodulatory role of dual checkpoint inhibitor blockade using ipilimumab/nivolumab in patients with extensive stage small cell lung cancer ======================================================================================================================================================================= AC Chiang, H Asghari, K Ashley, S Gettinger, S Goldberg, R Herbst, FH Wilson, BR Newton, MK Cohenuram, KD Sabbath, AS Talsania, AV Russo, E Schultz, S Skrzypczak, C Kingsford, and KS Schalper (2023) Molecular predictors and immunomodulatory role of dual checkpoint inhibitor blockade using ipilimumab/nivolumab in patients with extensive stage small cell lung cancer. J Clin Oncol 41, 2023 (suppl 16; abstr 8597). link: DOI https://doi.org/10.1200/JCO.2023.41.16_suppl.8597 link: Paper https://doi.org/10.1200/JCO.2023.41.16_suppl.8597 06/21/2023 111 0.6
Reinforcement Learning for Robotic Liquid Handler Planning Reinforcement Learning for Robotic Liquid Handler Planning ========================================================== Mohsen Ferdosi, Yuejun Ge, and Carl Kingsford (2023) Reinforcement Learning for Robotic Liquid Handler Planning. In Proceedings of WABI 2023. Robotic liquid handlers play a crucial role in automating laboratory tasks such as sample preparation, high-throughput screening, and assay development. Manually designing protocols takes significant effort, and can result in inefficient protocols and involve human error. We investigate the application of reinforcement learning to automate the protocol design process resulting in reduced human labor and further automation in liquid handling. We develop a reinforcement learning agent that can automatically output the step-by-step protocol based on the initial state of the deck, reagent types and volumes, and the desired state of the reagents after the protocol is finished. We show that finding the optimal protocol for solving a liquid handler instance is NP-complete, and we present a reinforcement learning algorithm that can solve the planning problem practically for cases with a deck of up to 20 × 20 wells and four different types of reagents. We design and implement an actor-critic approach, and we train our agent using the Impala algorithm. Our findings demonstrate that reinforcement learning can be used to automatically program liquid handler robotic arms, enabling more precise and efficient planning for the liquid handler and laboratory automation. link: Paper https://doi.org/10.4230/LIPIcs.WABI.2023.23 link: PDF pdf/2023-ferdosi-liquidhandler.pdf 06/21/2023 238 1.2
Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme ========================================================================================== Minh Hoang, Guillaume Marçais, and Carl Kingsford (2024) Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. Journal of Computational Biology 31(1):2-20. Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used. link: DOI https://doi.org/10.1089/cmb.2023.0212 link: Preprint https://doi.org/10.1101/2022.10.18.512430 link: Code https://github.com/Kingsford-Group/maskedminimizer 06/21/2023 312 1.6
Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection ======================================================================================= Yihang Shen and Carl Kingsford (2023) Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection. In Proceedings of AutoML 2023. Bayesian Optimization (BO) is a method for globally optimizing black-box func- tions. While BO has been successfully applied to many scenarios, developing effective BO algorithms that scale to functions with high-dimensional domains is still a challenge. Optimizing such functions by vanilla BO is extremely time-consuming. Alternative strategies for high-dimensional BO that are based on the idea of embedding the high-dimensional space to one with low dimensions are sensitive to the choice of the embedding dimension, which needs to be pre-specified. We develop a new computationally efficient high-dimensional BO method that exploits variable selection. Our method is able to automatically learn axis-aligned sub-spaces, i.e. spaces containing selected variables, without the demand of any pre-specified hyperparameters. We analyze the computational complexity of our algorithm. We empirically show the efficacy of our method on several synthetic and real problems. link: PDF pdf/2023-shen-vsbo.pdf 06/21/2023 183 0.9
Creating and Using Minimizer Sketches in Computational Genomics Creating and Using Minimizer Sketches in Computational Genomics =============================================================== Hongyu Zheng, Guillaume Marçais, and Carl Kingsford (2023) Creating and Using Minimizer Sketches in Computational Genomics. J Comp. Biol. (2023). Processing large datasets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared to the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future. link: PDF pdf/2023-zheng-minreview.pdf 06/21/2023 188 0.9
Carl chairs and CMU hosts an NSF-NIH Joint Workshop on Emerging AI in Biology Carl chairs and CMU hosts an NSF-NIH Joint Workshop on Emerging AI in Biology ============================================================================= June 10, 2023 In order to survey the current frontier of the interface between AI methodology and biology and to chart future directions and challenges, we held an "NSF-NIH Joint Workshop on Emerging AI in Biology" in June 2023 that gathered approximately 40 experts on the intersection of research in AI and biology. New techniques in artificial intelligence (AI) are rapidly being developed, extended and applied to challenging problems in biology. At the same time, as new assays, new data collection efforts, and greater understanding are developed in biology, the scope of problems that are amendable to AI approaches is growing. The workshop included scientific presentations about cutting-edge applications and computational methodology relating to the use of AI in biological sciences, many of which were subsequently publicly posted to the web. The workshop also included significant discussion time, during which participants collaborated on drafting a report to the NIH and NSF that communicated some of the opportunities and challenges in applying AI in biology. These insights were shaped into a final report that was delivered to the NSF and NIH. https://www.cs.cmu.edu/cih/events/cih-event-ai-biology-2023 Videos of some of the talks are available here. https://www.youtube.com/playlist?list=PLzjJEuCotCQxpyY4gna09l1rPiramjNrc 06/10/2023 225 1.1
Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants ================================================================================================== Yutong Qiu, Yihang Shen, and Carl Kingsford (2023) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. In Proceedings of WABI 2023. The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error- prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs will always yield optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics that estimate GTED efficiently. link: DOI 10.4230/LIPIcs.WABI.2023.11 link: Preprint https://arxiv.org/abs/2305.10577 link: Paper https://doi.org/10.4230/LIPIcs.WABI.2023.11 link: PDF pdf/2023-qui_shen-gted.pdf link: Code https://github.com/Kingsford-Group/gtednewilp/ link: BibTeX bibtex/2023-qiu_shen-gted.bib 05/19/2023 290 1.4
Yutong Qiu successfully defends her dissertation Yutong Qiu successfully defends her dissertation ================================================ May 5, 2023 Yutong Qiu, a CPCB Ph.D. student, successfully defended her dissertation titled “Algorithmic Foundations of Genome Graph Construction and Comparison.” She will join Illumina, Inc. Congratulations, Dr. Qiu! Yutong Qiu (2023) Yutong Qiu successfully defends her dissertation. Ph.D. Thesis CMU-CB-23-101. Pangenomic studies have enabled a more accurate depiction of the human genome landscape. Genome graphs are suitable data structures for analyzing collections of genomes due to their efficiency and flexibility of encoding shared and unique substrings from the population of encoded genomes. Novel challenges arise when genome graphs are applied to thousands of genomes because current genome graph models are insufficient in addressing the questions: (1) How can genome graphs be constructed efficiently that optimize the storage space? (2) How can genome graphs be used to more accurately and more efficiently compare heterogeneous sequences such as cancer genomes or immune repertoires? To answer these questions, we lay algorithmic foundations for genome graph construction and comparison. The size of a genome graph is crucial to both efficient storage and analysis. How- ever, few genome graph construction methods directly optimize the graph size. By drawing connections to data compression, we develop an algorithmic framework for genome graph construction that prioritizes genome graph size and show that the new framework produces small genome graphs efficiently compared to other genome graph schemes. Our compression-based framework not only removes the depen- dency on hyper-parameters but also opens up the potential for adapting established compression algorithms to construct better genome graphs. In many scenarios, such as immune repertoire analysis, we need to quantify the similarity between heterogeneous sets of genomic strings, but the complete strings are unknown due to limitations in sequencing technology. The distance between genome graphs can be used to estimate to the difference between these strings. One important metric is defined as the graph traversal edit distance (GTED). We revisit the complexity of and the previously proposed algorithms for GTED. We prove that GTED is NP-complete and show that the previously proposed algorithms computes a lower bound of GTED. In addition, we propose two correct ILP formulations of GTED and characterize the relationship between GTED and the previous lower bound ILPs. We evaluate the empirical efficiency of solving GTED and its lower bound ILP and show that solving GTED exactly with ILPs is currently not practical on larger genomes. Genome graphs are often highly expressive and represent more than one string sets, and thus the distance between two graphs using standard graph distances does not always model the actual edit distance between true string sets. To quantify this discrepancy, we formally define genome graph expressiveness as its diameter and use it to bound the deviation of the genome graph distance from string set distances. We produce a more accurate distance measure between (unseen) collections of strings encoded as genome graphs. The new distance measure and its deviation from string set distances are evaluated on simulated human T-cell repertoire sequences and Hepatitis B virus genomes. link: PDF pdf/2023-news-yutongdefends.pdf 05/05/2023 515 2.6
Laura Tung successfully defends her Ph.D. thesis Laura Tung successfully defends her Ph.D. thesis ================================================ October 28, 2022 Laura Tung successfully defended her Ph.D. thesis. She will join Gaurdant Health, Inc. Congratulations, Dr. Tung! [Image Omitted] Laura H. Tung (2022) Laura Tung successfully defends her Ph.D. thesis. Ph.D. (CMU-CB-22-108). Studying the transcriptome is crucial to understanding functional elements of the genome and elucidating biological pathways associated with disease. High- throughput sequencing such as RNA-seq has become a powerful tool for transcrip- tome analysis. Due to limited read lengths, identifying full-length transcripts from short reads remains challenging. As third-generation sequencing becomes increas- ingly important, single-molecule long reads have been used to improve transcrip- tome analyses such as isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis. This drives a need for transcript assembly on long reads. We developed a reference-based long-read transcript assembler, Scallop-LR, aiming to discover more novel isoforms. Analyzing a considerable number of RNA-seq long-read samples, we quantified the benefit of performing transcript assembly on long reads. We demonstrate that Scallop-LR identifies more known transcripts and potentially novel isoforms for the human transcriptome than Iso-Seq Analysis and StringTie, indicating that long-read tran- script assembly by Scallop-LR can reveal a more complete human transcriptome. Nanopore sequencing has become a leading choice for long-read RNA-seq. How- ever, Nanopore long reads have high error rates. For many non-model organisms without a high-quality reference, de novo (reference-free) error correction methods designed for RNA-seq long reads are needed. We developed a novel, error-profile-aware correction method, deepCorrRNA, for correcting RNA-seq long reads de novo using deep learning. deepCorrRNA combines a graph-based method and a deep neural network that incorporates the error profile related information systematically. We show that ML-based deepCorrRNA achieves comparable error-rate reductions to state-of-the-art ONT-specific isONcorrect. Across different organisms, deepCor-rRNA demonstrates robust de novo error correction capability, which can benefit the transcriptome studies of non-model organisms. deepCorrRNA’s method in principle is generalizable and may be applied to different technologies. To accelerate transcriptome analyses, RNA-seq analysis tools require comprehensive evaluation and parameter optimization. While the number of RNA-seq samples grows enormously at large sequence databases, most RNA-seq analysis tools are evaluated on limited RNA-seq samples. This leads to a need to select a representative subset from RNA- seq samples at large databases, which effectively summarizes the original collection of RNA-seq samples. We developed a novel hierarchical representative set selection method, to tackle the memory and runtime challenges in k-mer counting approaches for RNA-seq samples in a large database. We demonstrate that hierarchical represen- tative set selection achieves summarization quality close to direct representative set selection, while largely reducing the runtime and memory usage, and substantially outperforms random sampling on the entire SRA set of human RNA-seq samples. The algorithms, methods, and analysis we have developed can be used to improve transcriptome analyses and further our understanding of complex transcriptomes. link: PDF pdf/2022-news-tungphd.pdf 10/28/2022 529 2.6
The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance ================================================================================================================== Yutong Qiu and Carl Kingsford (2022) The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance. In Proceedings of ISMB 2022 in Bioinformatics 38(Supplement_1):i404-412 (2022).. Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover's Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation: Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/. link: DOI 10.1093/bioinformatics/btac264 link: Paper https://doi.org/10.1093/bioinformatics/btac264 link: BibTeX bibtex/2022-qiu-expressive.bib 08/24/2022 333 1.7
Hongyu Zheng successfully defends his Ph.D. dissertation Hongyu Zheng successfully defends his Ph.D. dissertation ======================================================== February 24, 2022 Hongyu Zheng successfully defended his Ph.D. thesis. He will joint Princeton University as a postdoc. Congradulations, Dr. Zheng! Hongyu Zheng (2022) Hongyu Zheng successfully defends his Ph.D. dissertation. Ph.D. Dissertation. Sequence sketch methods generate compact fingerprints of large string sets for efficient indexing and searching. Minimizers are one of such sketching methods, sampling k-mers from a string by selecting the minimal k-mer from each sliding window with a predetermined ordering. Minimizers sketches preserve information to detect sufficiently long substring matches. This favorable property of minimizers, and its ease of implementation, lead to wide adoption in a large number of software and pipelines, including read mappers, genome assemblers, sequence databases and more. Despite the method’s popularity, many theoretical and practical questions regarding its performance remain open. This is especially true regarding the density of minimizers, a simple yet powerful metric for minimizer performance. We have neither understanding of density growth under various conditions, nor principled approaches to construct minimizers with provably low density. This dissertation attacks the lack of knowledge in minimizer sketches from multiple fronts. In the first part, we are primarily concerned with asymptotic behavior for the density of minimizer sketches. Minimizers are parameterized by the k-mer length k, a window length w and an order on the k-mers. Using a number of new techniques, we are able to provide a complete picture of how the density of optimal minimizer grows with its parameters w and k. We also derive structural lemmas for universal hitting sets and local schemes, two highly related concepts for minimizer design. Together, these results serve as building blocks for future theoretical advances. The next two parts are focused on constructing low-density minimizers in two scenarios. In the second part, we propose the Miniception, the first construction of minimizers that provably achieves better densities than a random minimizer in practical configurations of w and k. This method is also simple to implement and requires minimal overhead compared to existing implementations of random mini- mizers. In the third and final part, we consider optimization of minimizer density on a given reference sequence. The most common use case is when the given se- quence is the reference genome or transcriptome. We propose the concept of polar sets, which itself is a complementary concept of universal hitting sets and similarly can be used to construct minimizers. Using polar sets, we propose efficient algo- rithms to directly optimize the density of minimizers on a reference sequence up to some error. Experiments show that both the Miniception and the polar set algorithms can reliably outperform existing methods for constructing low-density minimizers, without a reference and with a reference, respectively. link: PDF pdf/2022-news-zhengphd.pdf 02/24/2022 467 2.3
DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes ============================================================================================ Minh Hoang, Hongyu Zheng, and Carl Kingsford (2022) DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes. In Proceedings of RECOMB 2022.. Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite its widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question. Most work in this direction focuses on designing schemes that work well on expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, require greedy approximations to solve an intractable discrete optimization problem on the permutation space of $k$-mer orderings. To address this challenge, we propose: (a) a reformulation of the combinatorial solution space using a deep neural network reparameterization; and (b) a fully differentiable approximation of the discrete objective. We demonstrate that our framework, extsc{DeepMinimizer}, discovers minimizer schemes that significantly outperform state-of-the-art constructions on genomic sequences. link: Code https://github.com/Kingsford-Group/deepminimizer 12/21/2021 194 1.0
Personalized neural architecture search for federated learning. Personalized neural architecture search for federated learning. =============================================================== Minh Hoang and Carl Kingsford (2021) Personalized neural architecture search for federated learning.. In NeurIPS 2021 Workshop on New Frontiers in Federated Learning 2021.. Federated Learning (FL) is a recently proposed learning paradigm for decentralized devices to collaboratively train a predictive model without exchanging private data. Existing FL frameworks, however, assume a one-size-fit-all model architecture to be collectively trained by local devices, which is determined prior to observing their data. Even with good engineering acumen, this often falls apart when local tasks are different and require diverging choices of architecture modelling to learn effectively. This motivates us to develop a novel personalized neural architecture search (NAS) algorithm for FL, which learns a base architecture that can be structurally personalized for quick adaptation to each local task. On several real- world datasets, our algorithm, FEDPNAS is able to achieve superior performance compared to other benchmarks on heterogeneous multitask scenarios. link: Paper https://neurips2021workshopfl.github.io/NFFL-2021/papers/2021/Hoang2021.pdf 08/24/2021 170 0.8
How much data is sufficient to learn high-performing algorithms? How much data is sufficient to learn high-performing algorithms? ================================================================ Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2019) How much data is sufficient to learn high-performing algorithms?. STOC 2021 (preprint arXiv:1908.02894 [cs.LG] (2019)). Algorithms for scientific analysis typically have tunable parameters that significantly influence computational efficiency and solution quality. If a parameter setting leads to strong algorithmic performance on average over a set of typical problem instances, that parameter setting---ideally---will perform well in the future. However, if the set of typical problem instances is small, average performance will not generalize to future performance. This raises the question: how large should this set be? We answer this question for any algorithm satisfying an easy-to-describe, ubiquitous property: its performance is a piecewise-structured function of its parameters. We are the first to provide a unified sample complexity framework for algorithm parameter configuration; prior research followed case-by-case analyses. We present applications from diverse domains, including biology, political science, and economics. link: Preprint https://arxiv.org/abs/1908.02894 link: BibTeX bibtex/2019-balcan-sampcomplex.bib 08/09/2021 189 0.9
Discovery of a potential predictive marker for eribulin treatment and novel target genes in BRAF V600E mutant metastatic colorectal cancer using an AI-driven RNA-seq analysis platform: Translational research of the BRAVERY study (EPOC1701) Discovery of a potential predictive marker for eribulin treatment and novel target genes in BRAF V600E mutant metastatic colorectal cancer using an AI-driven RNA-seq analysis platform: Translational research of the BRAVERY study (EPOC1701) =============================================================================================================================================================================================================================================== Toshiki Masuishi, Hiroya Taniguchi, Daisuke Kotani, Hideaki Bando, Taroh Satoh, Taito Esaki, Yoshito Komatsu, Yu Sunakawa, Tomohiro Nishina, Eiji Shinozaki, Naohiro Nishida, Masato Komoda, Satoshi Yuki, Naoki Izawa, Gaurav Sharma, Stan Skrzypczak, Eric Schultz, Carl Kingsford, Akihiro Sato, and Takayuki Yoshino (2021) Discovery of a potential predictive marker for eribulin treatment and novel target genes in BRAF V600E mutant metastatic colorectal cancer using an AI-driven RNA-seq analysis platform: Translational research of the BRAVERY study (EPOC1701). Journal of Clinical Oncology 39(15 suppl): e15532 (2021). 06/21/2021 121 0.6
Constructing small genome graphs via string compression Constructing small genome graphs via string compression ======================================================= Yutong Qiu and Carl Kingsford (2021) Constructing small genome graphs via string compression. ISMB 2021. The size of a genome graph -- the space required to store the nodes, their labels and edges -- affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. The size of the graph also affects the size of the graph index that is used to speed up the alignment. This raises the need for approaches to construct space-efficient genome graphs. We point out similarities in the string encoding approaches of genome graphs and the external pointer macro (EPM) compression model. Supported by these similarities, we present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. We show that the algorithms result in an upper bound on the size of the genome graph constructed based on an optimal EPM compression. In addition to the transformation, we show that equivalent choices made by EPM compression algorithms may result in different sizes of genome graphs. To further optimize the size of the genome graph, we purpose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv EPM compression algorithm. We show that using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph software is available at https://github.com/Kingsford-Group/rlzgraph link: DOI https://doi.org/10.1101/2021.02.08.430279 link: Preprint https://www.biorxiv.org/content/10.1101/2021.02.08.430279v1 link: Code https://github.com/Kingsford-Group/rlzgraph 02/10/2021 344 1.7
Practical selection of representative sets of RNA-seq samples using a hierarchical approach Practical selection of representative sets of RNA-seq samples using a hierarchical approach =========================================================================================== Laura H. Tung and Carl Kingsford (2021) Practical selection of representative sets of RNA-seq samples using a hierarchical approach. ISMB 2021. Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the huge number of available RNA-seq samples and the large number of k-mers/sequences in each sample, computing the full similarity matrix between all samples using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges, making direct representative set selection infeasible with limited computing resources. Therefore, we developed a novel computational method called "hierarchical representative set selection" to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks the representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve performance close to that of direct representative set selection, while largely reducing the runtime and memory requirements of computing the full similarity matrix (up to 8.4X runtime reduction and 4.7X memory reduction for 10000 samples that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases such as the SRA. link: DOI https://doi.org/10.1101/2021.02.04.429817 link: Preprint https://biorxiv.org/cgi/content/short/2021.02.04.429817v1 link: Code https://github.com/Kingsford-Group/hierrepsetselection 02/06/2021 342 1.7
Yutong Qiu awarded SCS Cancer Research fellowship Yutong Qiu awarded SCS Cancer Research fellowship ================================================= February 5, 2021 CPCB Ph.D. student Yutong Qiu has been awarded an SCS Cancer Research Fellowship. Yutong works on computational methods for understanding the human genome, including methods to identify variants within single genomes and populations of genomes. Her recent work is focused on construction and use of genome graphs for applications in cancer, especially more accurate subtyping of cancers from genomic features. [Image Omitted] Congratulations! More details here http://cbd.cmu.edu/news/2020/scs-cancer-research-fellowships-awarded-to-yutong-qiu-and-trevor-frisby.html 02/05/2021 95 0.5
Carl Kingsford awarded the Herbert A. Simon Professorship in Computer Science Carl Kingsford awarded the Herbert A. Simon Professorship in Computer Science ============================================================================= February 4, 2021 Carl will receive the Herbert A. Simon Professorship of Computer Science in a virtual ceremony at 5:30 p.m. on Thursday, Feb. 4, 2021. For more information see here. https://www.cs.cmu.edu/news/2021/scs-celebrates-simon-alumni-research-professorships 02/04/2021 57 0.3
Sequence-specific minimizers via polar sets Sequence-specific minimizers via polar sets =========================================== Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2021) Sequence-specific minimizers via polar sets. ISMB 2021. Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset. link: DOI https://doi.org/10.1101/2021.02.01.429246 link: Preprint https://www.biorxiv.org/content/10.1101/2021.02.01.429246v1 02/03/2021 273 1.4
Accounting for fragments of unexpected origin improves transcript quantification in RNA-seq simulations focused on increased realism Accounting for fragments of unexpected origin improves transcript quantification in RNA-seq simulations focused on increased realism ==================================================================================================================================== Avi Srivastava, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Carl Kingsford, and Rob Patro (2021) Accounting for fragments of unexpected origin improves transcript quantification in RNA-seq simulations focused on increased realism. bioRxiv, doi: https://doi.org/10.1101/2021.01.17.426996. Transcript and gene quantification is the first step in many RNA-seq analyses. While many factors and properties of experimental RNA-seq data likely contribute to differences in accuracy between various approaches to quantification, it has been demonstrated (1) that quantification accuracy generally benefits from considering, during alignment, potential genomic origins for sequenced fragments that reside outside of the annotated transcriptome. Recently, Varabyou et al. (2) demonstrated that the presence of transcriptional noise leads to systematic errors in the ability of tools — particularly annotation-based ones — to accurately estimate transcript expression. Here, we confirm the findings of Varabyou et al. (2) using the simulation framework they have provided. Using the same data, we also examine the methodology of Srivastava et al.(1) as implemented in recent versions of salmon (3), and show that it substantially enhances the accuracy of annotation-based transcript quantification in these data. link: Preprint https://www.biorxiv.org/content/10.1101/2021.01.17.426996v1.full 01/27/2021 219 1.1
Capturing the complexity of topologically associating domains through multi-feature optimization Capturing the complexity of topologically associating domains through multi-feature optimization ================================================================================================ Natalie Sauerwald and Carl Kingsford (2021) Capturing the complexity of topologically associating domains through multi-feature optimization. bioRxiv. The three-dimensional structure of human chromosomes is tied to gene regulation and replication timing, but there is still a lack of consensus on the computational and biological definitions for chromosomal substructures such as topologically associating domains (TADs). TADs are described and identified by various computational properties leading to different TAD sets with varying compatibility with biological properties such as boundary occupancy of structural proteins. We unify many of these computational and biological targets into one algorithmic framework that jointly maximizes several computational TAD definitions and optimizes TAD selection for a quantifiable biological property. Using this framework, we explore the variability of TAD sets optimized for six different desirable properties of TAD sets: high occupancy of CTCF, RAD21, and H3K36me3 at boundaries, reproducibility between replicates, high intra- vs inter-TAD difference in contact frequencies, and many CTCF binding sites at boundaries. The compatibility of these biological targets varies by cell type, and our results suggest that these properties are better reflected as subpopulations or families of TADs rather than a singular TAD set fitting all TAD definitions and properties. We explore the properties that produce similar TAD sets (reproducibility and inter- vs intra-TAD difference, for example) and those that lead to very different TADs (such as CTCF binding sites and inter- vs intra-TAD contact frequency difference). link: Preprint https://www.biorxiv.org/content/10.1101/2021.01.04.425264v1 link: Code https://github.com/Kingsford-Group/frankentad 01/06/2021 268 1.3
Cong Ma successfully defends her Ph.D. thesis. Cong Ma successfully defends her Ph.D. thesis. ============================================== October 12, 2020 Cong Ma has successfully defended her Ph.D. Congratulations Dr. Ma! Cong will join Princeton University as a postdoc. Cong Ma (2020) Cong Ma successfully defends her Ph.D. thesis.. Ph.D. Thesis. Anomalies are data points that do not follow established or expected patterns. When measuring gene expression, anomalies in RNA-seq are observations or pat- terns that cannot be explained by the inferred transcript sequences or expressions. Transcript sequences and expression are key indicators for cell status and are used in many phenotypic and disease analyses. Identifying such unexplainable RNA-seq patterns can inspire improvements in the accuracy of inferred transcript sequences and expression of RNA-seq data and benefit the analyses based on transcripts. We develop computational methods to identify the RNA-seq anomalies that violate in- ferred sequence variation and expression patterns, and to improve the reconstructed transcripts such that they can explain the anomalies. The first type of anomaly that we detect is the large-scale sequence variation in transcriptome, or transcriptomic structural variants (TSVs). TSVs are usually in- duced by genomic structural variants, which can fuse sequences either from a pair of genes or involve intergenic regions. Previous TSV detection methods assume that TSVs only fuse a pair genes and do not consider that some genes are still un- known, thus many RNA-seq reads from the intergenic or intronic regions cannot be explained by gene fusions. We develop a computational method, SQUID, to identify fusions both between a pair of genes and involving non-transcribing regions, thus enlarging the set of explained variants and RNA-seq reads. SQUID is further ex- tended to the MULTIPLE COMPATIBLE ARRANGEMENTS PROBLEM, which is able to detect TSVs in the allele heterogeneity context. The second type of anomaly that we identify are coverage anomalies in estimated expression. The number of RNA- seq reads at each position along each transcript follows a distribution determined by the RNA-seq experiment protocol. We develop a method, Salmon Anomaly Detec- tion (SAD), to identify the transcripts with an unexplainable coverage distribution by RNA-seq protocol. We observe that both quantification algorithm mistakes and incomplete reference transcripts cause abnormal coverage patterns. We also develop an adjustment procedure to correct quantification algorithm mistakes indicated by coverage anomalies and improve the accuracy of estimated expression. Our analysis of the coverage anomalies shows that some of the coverage anomalies are indica- tors of the regulation efficiency of transcription factors and can explain a part of the variability of the target gene expression. The developed methods introduce novel dimensions to more completely explain RNA-seq data, and can be incorporated into RNA-seq analyses to better characterize phenotype-transcript relationships or used to evaluate future transcript reconstruction methods. link: PDF pdf/2020-news-maphd.pdf 10/12/2020 470 2.4
Probabilistic method corrects previously uncharacterized Hi-C artifact Probabilistic method corrects previously uncharacterized Hi-C artifact ====================================================================== Yihang Shen and Carl Kingsford (2020) Probabilistic method corrects previously uncharacterized Hi-C artifact. bioRxiv. Three-dimensional chromosomal structure plays an important role in gene regulation. Chromosome conformation capture techniques, especially the high-throughput, sequencing-based technique Hi-C, provide new insights on spatial architectures of chromosomes. However, Hi-C data contains artifacts and systemic biases that substantially influence subsequent analysis. Computational models have been developed to address these biases explicitly, however, it is difficult to enumerate and eliminate all the biases in models. Other models are designed to correct biases implicitly, but they will also be invalid in some situations such as copy number variations. New methods are required for better bias correction. We characterize a new kind of artifact in Hi-C data. We find that this artifact is caused by incorrect alignment of Hi-C reads against approximate repeat regions and can lead to erroneous chromatin contact signals. The artifact cannot be corrected by current Hi-C correction methods. We design a probabilistic method and develop a new Hi-C processing pipeline by integrating our probabilistic method with the HiC-Pro pipeline. We find that the new pipeline can remove this new artifact effectively, while preserving important features of the original Hi-C matrices. link: Preprint https://www.biorxiv.org/content/10.1101/2020.10.07.325332v1 10/08/2020 228 1.1
Natalie Sauerwald successfully defends her Ph.D. thesis. Natalie Sauerwald successfully defends her Ph.D. thesis. ======================================================== Natalie Sauerwald (2020) Natalie Sauerwald successfully defends her Ph.D. thesis.. Ph.D. Dissertation. October 5, 2020 The last two decades have introduced several experimental methods for studying three-dimensional chromosome structure, opening up a new dimension of genomics. Studies of these new data types have shown great promise in explaining some of the open questions in gene regulation, but the experiments are indirect and imperfect measurements of the underlying structure, requiring rigorous computational methods. We can now study the 3D relationships between all pairs of chromosome segments across the genome, but questions such as the variability of this structure between cell and tissue types, the predictors of structural similarity, the dynamics of this complex system, and a complete definition of the observed substructures remain unclear. This dissertation presents several approaches to improve our understanding of human genomic spatial architecture. We present a new method to quantify the variability of chromosomal substructures, called topologically-associating domains (TADs) between any pair of samples. This algorithm efficiently identifies all regions with statistically significantly similar TAD structures between the two samples. Us- ing this method, we quantify the structural similarity within each chromosome and between chromosomes, and between cell types. We show that cancer cell lines are structurally disrupted at pan-cancer genes, but not globally. We perform extensive data analysis using this method and others to assess the consistency of TADs across a range of biological and technical conditions. This large scale study of chromosomal structural variability emphasizes the differences between chromosome structures be- tween cell and tissue types, in contrast to the belief that genome structure is highly conserved. We quantify the influence of genetic difference and similarity, as well as technical confounders, on chromosome structural similarity in a systematic study of over 100 samples. We also apply a biophysics model to predict the dynamics of chromosomes from static data. Our predictions correlate well with several different experimental measures and known substructures. We predict the existence of long range dynamic couplings involved in gene regulation that have not been found with- out a dynamic model. Finally, we develop a generalized TAD-finding algorithm that can be guided towards selecting TADs for any desired property. Defining several functions around common evaluation criteria for TADs, we explore the relationships between various biological TAD properties and the computational definitions used to identify TADs. The algorithms and analysis we have developed enable rigorous study of the basic properties of this new dimension of genomics, and can continue to inform the study of TADs as more experimental data becomes available. link: PDF pdf/sauerwald-2020-phd.pdf 10/05/2020 439 2.2
Exact Transcript Quantification Over Splice Graphs Exact Transcript Quantification Over Splice Graphs ================================================== Cong Ma, Hongyu Zheng, and Carl Kingsford (2020) Exact Transcript Quantification Over Splice Graphs. In Proceedings of WABI 2020; journal version in Alg. Mol. Bio. 2021. The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed. We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method. This new approach is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis. The source code is available at https://github.com/Kingsford-Group/subgraphquant. link: DOI 10.4230/LIPIcs.WABI.2020.12 link: Paper https://doi.org/10.4230/LIPIcs.WABI.2020.12 08/25/2020 260 1.3
Novel target discovery in pembrolizumab-resistant gastric cancer using a comprehensive RNA-seq analysis pipeline Novel target discovery in pembrolizumab-resistant gastric cancer using a comprehensive RNA-seq analysis pipeline ================================================================================================================ J Lee, ST Kim, KM Kim, E Schultz, S Skrzypczak, R Patro, and C Kingsford (2020) Novel target discovery in pembrolizumab-resistant gastric cancer using a comprehensive RNA-seq analysis pipeline. Journal of Clinical Oncology 38(15_suppl), e16541-e16541 (2020). 06/21/2020 55 0.3
Optimizing Dynamic Structures with Bayesian Generative Search Optimizing Dynamic Structures with Bayesian Generative Search ============================================================= Minh Hoang and Carl Kingsford (2020) Optimizing Dynamic Structures with Bayesian Generative Search. ICML 2020. Kernel selection for kernel-based model is a difficult problem. Exact solutions are, in general, prohibitively expensive to solve for due to the NP-hard nature of these combinatorial tasks. In addition, gradient-based optimization techniques are inapplicable due to the non-differentiability of the objective function. As such, many state-of-the-art solutions for this problem resorts to heuristic search and gradient-free optimization. These approaches, however, require imposing restrictive assumptions on the explorable space of candidate structures such as randomized candidate pool and/or upper-bounded structure dimension, thus depending heavily on the intuition of domain experts. This paper instead proposes DTERGENS, a novel generative search framework that constructs and optimizes a recursive structure generation routine for high-performance composite kernel expressions. DTERGENS does not require restricting the space of candidate kernels and is capable of exploring arbitrary-dimensional kernel structures by jointly optimizing a stopping criterion for the kernel structure generator. We demonstrate that our framework is able to explore a more diverse range of kernel structures and obtain better results than state-of-the-art approaches on many real-world predictive tasks. 06/01/2020 209 1.0
Harvestman: A framework for hierarchical feature learning and selection from whole genome sequencing data. Harvestman: A framework for hierarchical feature learning and selection from whole genome sequencing data. ========================================================================================================== Trevor Frisby, Shawn James Baker, Guillaume Marçais, Quang Minh Hoang, Carl Kingsford, and Christopher James Langmead (2020) Harvestman: A framework for hierarchical feature learning and selection from whole genome sequencing data.. BMC Bioinformatics 22:174 (2021). We present Harvestman, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. We demonstrate that Harvestman scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, the largest publicly available collection of whole genome sequences. Next, using breast cancer data from The Cancer Genome Atlas, we show that Harvestman selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. Finally, we compare Harvestman to existing feature selection methods and demonstrate that our method selects smaller and less redundant feature subsets, while maintaining accuracy of the resulting classifier. The data used is available through either the 1000 Genomes Project or The Cancer Genome Atlas. Access to TCGA data requires the completion of a Data Access Request through the Database of Genotypes and Phenotypes (dbGaP). Binary releases of Harvestman compatible with Linux, Windows, and Mac are available for download at https://github.com/cmlh-gp/Harvestman-public/releases link: DOI https://doi.org/10.1186/s12859-021-04096-6 link: Preprint https://www.biorxiv.org/content/10.1101/2020.03.24.005603v1 link: Paper https://rdcu.be/chVE2 link: Code https://github.com/cmlh-gp/Harvestman-public/releases 03/25/2020 280 1.4
Improved design and analysis of practical minimizers Improved design and analysis of practical minimizers ==================================================== Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2020) Improved design and analysis of practical minimizers. Proceedings of ISMB 2020 (Bioinformatics). Minimizers are methods to sample k-mers from a sequence, with the guarantee that similar set of k-mers will be chosen on similar sequences. It is parameterized by the k-mer length k, a window length w and an order on the k-mers. Minimizers are used in a large number of softwares and pipelines to improve computation efficiency and decrease memory usage. Despite the method's popularity, many theoretical questions regarding its performance remain open. The core metric for measuring performance of a minimizer is the density, which measures the sparsity of sampled k-mers. The theoretical optimal density for a minimizer is 1/w, provably not achievable in general. For given k and w, little is known about asymptotically optimal minimizers, that is minimizers with density O(1/w). We derive a necessary and sufficient condition for existence of asymptotically optimal minimizers. We also provide a randomized algorithm, called the Miniception, to design minimizers with the best theoretical guarantee to date on density in practical scenarios. Constructing and using the Miniception is as easy as constructing and using a random minimizer, which allows the design of efficient minimizers that scale to the values of k and w used in current bioinformatics software programs. link: DOI https://doi.org/10.1101/2020.02.07.939025 link: Preprint https://www.biorxiv.org/content/10.1101/2020.02.07.939025v2.full.pdf link: Code https://github.com/Kingsford-Group/miniception 02/10/2020 268 1.3
Lower density selection schemes via small universal hitting sets with short remaining path length Lower density selection schemes via small universal hitting sets with short remaining path length ================================================================================================= Hongyu Zheng, Carl Kingsford, and Guillaume Marçais (2020) Lower density selection schemes via small universal hitting sets with short remaining path length. Proceedings of RECOMB 2020 (preprint: arXiv:2001.06550 [cs.DS]). Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimizers schemes which can be used as replacement for minimizers scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a universal hitting set. We give bounds for the remaining path length of the Mykkeltveit universal hitting set. Additionally, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound. link: Preprint https://arxiv.org/abs/2001.06550 01/21/2020 199 1.0
VariantStore: A Large-Scale Genomic Variant Search Index VariantStore: A Large-Scale Genomic Variant Search Index ======================================================== Prashant Pandey, Yinjie Gao, and Carl Kingsford (2019) VariantStore: A Large-Scale Genomic Variant Search Index. *Genome Biology* **22**: 231 (2021). The ability to efficiently query genomic variation data from thousands of samples is critical to achieve the full potential of many medical and scientific applications such as personalized medicine. We present VariantStore, a system for efficiently indexing and searching millions of genomic variants across thousands of samples. We show the scalability of VariantStore by indexing genomic variants from the TCGA-BRCA project containing 8640 samples and 5M variants in ~ 4 Hrs and the 1000 genomes project containing 2500 samples and 924M variants in ~ 3 Hrs. Querying for variants in a gene takes between 2 milliseconds to 3 seconds using memory only ~ 10% of the size of the full representation. As a baseline, VariantStore outperformed VG toolkit by 3X in terms of memory-usage and construction time and uses 25% less disk space although VG toolkit does not support variant queries. link: DOI 10.1186/s13059-021-02442-8 link: Preprint https://www.biorxiv.org/content/10.1101/2019.12.24.888297v1 link: Paper https://doi.org/10.1186/s13059-021-02442-8 link: Code https://github.com/Kingsford-Group/variantstore 12/26/2019 209 1.0
Estimating mutual information under measurement error Estimating mutual information under measurement error ===================================================== Cong Ma and Carl Kingsford (2019) Estimating mutual information under measurement error. bioRxiv 852384. Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases. As a result, the distribution of the signals is blurred, and the mutual information may be biased when estimated using the blurred measurements. We derive a corrected estimator for mutual information that accounts for the distribution of measurement error. Our corrected estimator is based on the correction of the probability mass function (PMF) or probability density function (PDF, based on kernel density estimation). We prove that the corrected estimator is asymptotically unbiased in the (semi-) discrete case when the distribution of measurement error is known. We show that it reduces the estimation bias in the continuous case under certain assumptions. On simulated data, our corrected estimator leads to a more accurate estimation for mutual information when the sample size is not the limiting factor for estimating PMF or PDF accurately. We compare the uncorrected and corrected estimator on the gene expression data of TCGA breast cancer samples and show a difference in both the value and the ranking of estimated mutual information between the two estimators. link: DOI 10.1101/852384 link: Preprint https://www.biorxiv.org/content/10.1101/852384v1 11/25/2019 248 1.2
ISMB Paper Selected as a Top-10 Reading Paper for 2017-2018 ISMB Paper Selected as a Top-10 Reading Paper for 2017-2018 =========================================================== November 10, 2019 Congratulations to Natalie Sauerwald, 5th-year CPCB Ph.D. student, for having one of her papers: N Sauerwald and C Kingsford. Quantifying the similarity of topological domains across normal and cancer human cell types, Bioinformatics (ISMB) 34(13):i475-i483 (2018). https://kingsfordlab.cbd.cmu.edu/2019-sauerwald-tda.html selected as one of the “Top 10 Reading Papers“ at RECOMB/ISCB Regulatory & Systems Genomics 2019. https://www.iscb.org/recomb-regsysgen2019-submissions/recomb-regsysgen2019-reading The paper developed a method to compare genome structure 11/10/2019 100 0.5
A statistical method for identifying consistently important features across samples A statistical method for identifying consistently important features across samples =================================================================================== Natalie Sauerwald and Carl Kingsford (2019) A statistical method for identifying consistently important features across samples. BioRxiv. In many applications, features with consistently high measurements across many samples are particularly meaningful and useful for quality control or biological interpretation. Identification of these features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values. We present a general method called conserved feature discovery (CFD) for identifying features with consistently strong signals across multiple conditions or samples. Given real-valued data, CFD requires no parameters, makes no assumptions on the underlying sample distributions, and is robust to differences across these distributions. We show that with high probability CFD identifies all true positives and no false positives under certain assumptions on the median and variance distributions of the feature measurements. Using simulated data, we show that CFD is tolerant to a small percentage of poor quality samples and robust to false positives. Applying CFD to RNA sequencing data from the Human Body Map project and GTEx, we identify lists of housekeeping genes as highly expressed genes across tissue types and compare to previous results in this domain. CFD is consistent between the two data sets, and identifies lists of genes enriched for basic cellular processes as expected. The framework can be easily adapted for many data types and desired feature properties. link: DOI 10.1101/833624 link: Preprint https://biorxiv.org/cgi/content/short/833624v1 link: Code https://github.com/Kingsford-Group/cfd 11/08/2019 262 1.3
Detecting, categorizing, and correcting coverage anomalies of RNA-seq quantification Detecting, categorizing, and correcting coverage anomalies of RNA-seq quantification ==================================================================================== Cong Ma and Carl Kingsford (2019) Detecting, categorizing, and correcting coverage anomalies of RNA-seq quantification. *Cell Systems* **9**:1-11. Due to incomplete reference transcriptomes, incomplete sequencing bias models, or other modeling defects, algorithms to infer isoform expression from RNA-seq sometimes do not accurately model expression. We present a computational method to detect instances where a quantification algorithm could not completely explain the input reads. Our approach identifies regions where the read coverage significantly deviates from expectation. We call these regions "expression anomalies". We further present a method to attribute their cause to either the incompleteness of the reference transcriptome or algorithmic mistakes. We detect anomalies for 30 GEUVADIS and 16 Human Body Map samples. By correcting anomalies when possible, we reduce the number of falsely predicted instances of differential expression. Anomalies that cannot be corrected are suspected to indicate the existence of isoforms unannotated by the reference. We detected 88 common anomalies of this type and find that they tend to have a lower-than-expected coverage towards their 3' ends. link: DOI 10.1016/j.cels.2019.10.005 link: Preprint https://doi.org/10.1101/541714 link: Paper https://doi.org/10.1016/j.cels.2019.10.005 link: Code https://github.com/Kingsford-Group/sad link: BibTeX bibtex/2019-ma-anomalies.bib 10/18/2019 228 1.1
The human body at cellular resolution: the NIH Human Biomolecular Atlas Program The human body at cellular resolution: the NIH Human Biomolecular Atlas Program =============================================================================== HuBMAP Consortium (2019) The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. *Nature* **574**:187-192 (2019). Transformative technologies are enabling the construction of three-dimensional maps of tissues with unprecedented spatial and molecular resolution. Over the next seven years, the NIH Common Fund Human Biomolecular Atlas Program (HuBMAP) intends to develop a widely accessible framework for comprehensively mapping the human body at single-cell resolution by supporting technology development, data acquisition, and detailed spatial mapping. HuBMAP will integrate its efforts with other funding agencies, programs, consortia, and the biomedical research community at large towards the shared vision of a comprehensive, accessible three-dimensional molecular and cellular atlas of the human body, in health and under various disease conditions. link: Paper https://www.nature.com/articles/s41586-019-1629-x 10/09/2019 145 0.7
Fiyinfoluwa Gbosibo and co-authors win best student paper at ACM-BCB 2019 Fiyinfoluwa Gbosibo and co-authors win best student paper at ACM-BCB 2019 ========================================================================= September 10, 2019 The paper ‘Practical Universal k-mer Sets for Minimizer Schemes’, was awarded best student paper at the ACM-BCB 2019 conference in Niagara Falls, NY. [Image Omitted] A universal k-mer set is a set of length-k strings such that every string of length > w must contain one of the k-mers in the set. This paper started as a summer internship project for Fiyinfoluwa Gbosibo to develop computational methods for finding ‘universal hitting sets’ for larger values of k. Congraduations to Fiyinfoluwa Gbosibo, Dan DeBlasio, and Guillaume Marçais for the best student paper award! Dan DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, Guillaume Marçais. Practical Universal k-mer Sets for Minimizer Schemes. In Proceedings of ACM-BCB 2019 pages 167-176. https://dl.acm.org/citation.cfm?id=3342144 09/10/2019 147 0.7
Identification of Microbiota-Induced Gene Expression Changes in the Drosophila melanogaster Head Identification of Microbiota-Induced Gene Expression Changes in the Drosophila melanogaster Head ================================================================================================ Scott A. Keith, Rory Eutsey, Heewook Lee, Brad Solomon, Stacie Oliver, Carl Kingsford, N. Luisa Hiller, and Brooke M. McCartney (2019) Identification of Microbiota-Induced Gene Expression Changes in the Drosophila melanogaster Head. bioRxiv. Symbiotic microorganisms exert multifaceted impacts on the physiology of their animal hosts. Recent discoveries have shown the gut microbiota influence host brain function and behavior, but the host and microbial molecular factors required to actuate these effects are largely unknown. To uncover molecular mechanisms that underlie the gut-microbiota-brain axis, we used Drosophila melanogaster and its bacterial microbiota as a model to identify microbiota-dependent gene expression changes in the host brain and head. Specifically, we employed RNA-seq and nanoString nCounter technology to identify Drosophila genes that exhibit altered transcript levels in fly heads upon elimination of the microbiota. The identified genes, some of which exhibited sex-specific differences, have demonstrated or inferred functional roles in the immune response, metabolism, neuronal activity, and stress resistance. Overall, this study reveals microbiota-responsive genes in the fly head, an anatomical structure not previously investigated in this context. Our results serve as a foundation for future investigations of how microbe-driven gene expression changes impact Drosophila biology. link: Preprint https://doi.org/10.1101/561043 link: BibTeX bibtex/2019-keith-fly.bib 06/21/2019 227 1.1
Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem ======================================================================================================================= Yutong Qiu, Cong Ma, Han Xie, and Carl Kingsford (2019) Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 18:1--18:19 (2019); journal version in *Alg Mol Bio* (2020). Transcriptomic structural variants (TSVs) --- structural variants that affect expressed regions --- are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangement Problem (MCAP), which seeks k genome rearrangements to maximize the number of reads that are concordant with at least one rearrangement. This directly models the situation of a heterogeneous or diploid sample. We prove that MCAP is NP-hard and provide a 1/4-approximation algorithm for k=1 and a 3/4-approximation algorithm for the diploid case (k=2) assuming an oracle for k=1. Combining these, we obtain a 3/16-approximation algorithm for MCAP when k=2 (without an oracle). We also present an integer linear programming formulation for general k. We completely characterize the graph structures that require k>1 to satisfy all edges and show such structures are prevalent in cancer samples. We evaluate our algorithms on 381 TCGA samples and 2 cancer cell lines and show improved performance from the state-of-the-art TSV-calling tool, SQUID. link: DOI https://doi.org/10.1101/697367 link: Preprint http://biorxiv.org/cgi/content/short/697367v1 link: Paper https://doi.org/10.4230/LIPIcs.WABI.2019.18 link: BibTeX bibtex/2019-qiu-dsquid.bib 06/17/2019 291 1.5
Alignment and mapping methodology influence transcript abundance estimation Alignment and mapping methodology influence transcript abundance estimation =========================================================================== Avi Srivastava, Laraib Malik, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro (2019) Alignment and mapping methodology influence transcript abundance estimation. Genome Biology 21: 239.. The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. We investigate the effect of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential gene expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We discuss best practices regarding alignment for the purposes of quantification, and also introduce a new hybrid alignment methodology, called selective alignment (SA), to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. link: DOI 10.1101/657874 link: Preprint https://www.biorxiv.org/content/10.1101/657874v2 link: Paper https://doi.org/10.1186/s13059-020-02151-8 06/03/2019 258 1.3
Practical universal k-mer sets for minimizer schemes Practical universal k-mer sets for minimizer schemes ==================================================== Dan DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marcais (2019) Practical universal k-mer sets for minimizer schemes. To appear in ACM-BCB 2019. Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find k-mer orderings for minimizer schemes is through the use of universal k-mer sets, which are subsets of k-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences being identified as possible matches) are identified. Current methods for creating universal k-mer sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal k-mer sets that can be used to construct minimizer orders for large values of k that are practical. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers. link: DOI 10.1101/652925 link: Preprint https://doi.org/10.1101/652925 05/31/2019 273 1.4
Context-Aware Seeds for Read Mapping Context-Aware Seeds for Read Mapping ==================================== Hongyi Xin, Mingfu Shao, and Carl Kingsford (2019) Context-Aware Seeds for Read Mapping. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 15:1--15:13 (2019) (journal version in Alg Mol Bio). Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mapping but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS reduces seed frequencies by up to 25.4% when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver. link: DOI 10.1101/643072 link: Preprint https://www.biorxiv.org/content/10.1101/643072v1 link: Paper https://doi.org/10.4230/LIPIcs.WABI.2019.15 link: BibTeX bibtex/2019-xin-cas.bib 05/21/2019 279 1.4
Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads ==================================================================================== Laura H. Tung, Mingfu Shao, and Carl Kingsford (2019) Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads. *Genome Biology* **20**:287. Third-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates. Analyzing 26 SRA PacBio datasets using Scallop-LR, Iso-Seq Analysis, and StringTie, we quantified the amount by which assembly improved Iso-Seq results. Through combined evaluation methods, we found that Scallop-LR identifies 2100-4000 more (for 18 human datasets) or 1100-2200 more (for eight mouse datasets) known transcripts than Iso-Seq Analysis, which does not do assembly. Further, Scallop-LR finds 2.4-4.4 times more potentially novel isoforms than Iso-Seq Analysis for the human and mouse datasets. StringTie also identifies more transcripts than Iso-Seq Analysis. Adding long-read-specific optimizations in Scallop-LR increases the numbers of predicted known transcripts and potentially novel isoforms for the human transcriptome compared to several recent short-read assemblers (e.g. StringTie). Our findings indicate that transcript assembly by Scallop-LR can reveal a more complete human transcriptome. link: DOI 10.1101/632703 link: Preprint https://doi.org/10.1101/632703 link: Paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1883-0 05/11/2019 295 1.5
Collective Model Fusion for Multiple Black-Box Experts Collective Model Fusion for Multiple Black-Box Experts ====================================================== Quang Minh HoĂ ng, Nghia Hoang, Bryan Low, and Carl Kingsford (2019) Collective Model Fusion for Multiple Black-Box Experts. ICML PMLR 97:2742-2750, 2019. Model fusion is a fundamental problem in collective machine learning (ML) where independent experts with heterogeneous learning architectures are required to combine expertise to improve predictive performance. This is particularly challenging in information-sensitive domains (e.g., medical records in health-care analytics) where experts do not have access to each other's internal architecture and local data. To address this challenge, this paper presents the first collective model fusion framework for multiple experts with heterogeneous black-box architectures. The proposed method will enable this by addressing the following key issues of how black-box experts interact to understand the predictive behaviors of one another; how these understandings can be represented and shared efficiently among themselves; and how the shared understandings can be combined to generate high-quality consensus prediction. The performance of the resulting framework is analyzed theoretically and demonstrated empirically on several datasets. link: Paper http://proceedings.mlr.press/v97/hoang19a.html 04/22/2019 188 0.9
Sketching and Sublinear Data Structures in Genomics Sketching and Sublinear Data Structures in Genomics =================================================== Guillaume Marçais, Brad Solomon, Rob Patro, and Carl Kingsford (2019) Sketching and Sublinear Data Structures in Genomics. Annual Review of Biomedical Data Science **2**:93-118. Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each. link: Preprint https://www.annualreviews.org/doi/abs/10.1146/annurev-biodatasci-072018-021156 link: Paper http://www.annualreviews.org/eprint/SAEEJ5IATHJDK3ES4YST/full/10.1146/annurev-biodatasci-072018-021156 04/15/2019 142 0.7
Ph.D. student Laura Tung recieves NSF Graduate Research Fellowship Ph.D. student Laura Tung recieves NSF Graduate Research Fellowship ================================================================== April 11, 2019 Laura Tung, a Ph.D. student in the CPCB Computational Biology Program at Carnegie Mellon University has been awarded a NSF Graduate Research Fellowship. This highly competitive fellowship will partially fund her research for 3 years. Congratulations! 04/11/2019 50 0.3
Maximum Likelihood Reconstruction of Ancestral Networks by Integer Linear Programming Maximum Likelihood Reconstruction of Ancestral Networks by Integer Linear Programming ===================================================================================== Vaibhav Rajan, Carl Kingsford, and Xiuwei Zhang (2019) Maximum Likelihood Reconstruction of Ancestral Networks by Integer Linear Programming. *bioRxiv*. The evolutionary history of biological networks enables deep functional and evolutionary understanding of various bio-molecular processes. Network growth models, such as the Duplication-Mutation with Complementarity (DMC) model, provide a principled approach to characterizing the evolution of protein-protein interactions (PPI) based on duplication and divergence. Current methods for model-based ancestral network reconstruction, primarily use greedy heuristics and yield sub-optimal solutions. We present a new Integer Linear Programming (ILP) solution for maximum likelihood reconstruction of ancestral PPI networks using the DMC model. By construction, our model is designed to find the optimal solution. It can also use efficient heuristics from general-purpose ILP solvers to obtain multiple optimal and near-optimal solutions that may be useful in many applications. Experiments on synthetic and real data show that our ILP obtains solutions with higher likelihood than those from previous methods. We evaluate our algorithm on two real PPI networks, with proteins from the families of bZIP transcription factors and Commander complex. On both the networks, solutions from our ILP has higher likelihood and are in better agreement with independent biological evidence from other studies. Availability A Python implementation is available at https://bitbucket.org/cdal/. link: DOI 10.1101/574814 link: Preprint https://www.biorxiv.org/content/10.1101/574814v1 03/13/2019 241 1.2
Analysis of the structural variability of topologically associated domains as revealed by Hi-C Analysis of the structural variability of topologically associated domains as revealed by Hi-C ============================================================================================== Natalie Sauerwald, Akshat Singhal, and Carl Kingsford (2020) Analysis of the structural variability of topologically associated domains as revealed by Hi-C. *Nuc. Acids Res. Genomics and Bioinformatics* 2(1):lqz008 (2020). Three-dimensional chromosome structure plays an integral role in gene expression and regulation, replication timing, and other cellular processes. Topologically associated domains (TADs), building blocks of chromosome structure, are genomic regions with higher contact frequencies within the region than outside the region. A central question is the degree to which TADs are conserved or vary between conditions. We analyze 137 Hi-C samples from 9 studies under 3 measures to quantify the effects of various sources of biological and experimental variation. We observe significant variation in TAD sets between both non-replicate and replicate samples, and provide initial evidence that this variability does not come from genetic sequence differences. The effects of experimental protocol differences are also measured, demonstrating that samples can have protocol-specific structural changes, but that TADs are generally robust to lab-specific differences. This study represents a systematic quantification of key factors influencing comparisons of chromosome structure, suggesting significant variability and the potential for cell-type-specific structural features, which has previously not been systematically explored. The lack of observed influence of heredity and genetic differences on chromosome structure suggests that factors other than the genetic sequence are driving this structure, which plays an important role in human disease and cellular functioning. link: DOI 10.1093/nargab/lqz008 link: Preprint https://doi.org/10.1101/498972 link: Paper https://academic.oup.com/nargab/article/doi/10.1093/nargab/lqz008/5576144?guestAccessKey=2fb52071-31d9-42f7-a5de-9b2a5b90b115 link: BibTeX bibtex/2018-sauerwald-analysis.bib 01/01/2019 293 1.5
More accurate transcript assembly via parameter advising More accurate transcript assembly via parameter advising ======================================================== Dan DeBlasio, Kwanho Kim, and Carl Kingsford (2019) More accurate transcript assembly via parameter advising. The 2019 ICML Workshop on Computational Biology; journal version in Journal of Computational Biology. Computational tools used for genomic analyses are becoming more accurate but also increasingly sophisticated and complex. This introduces a new problem in that these pieces of software have a large number of tunable parameters which often have a large influence on the results that are reported. We quantify the impact of parameter choice on transcript assembly and take some first steps towards generating a truly automated genomic analysis pipeline by developing a method for automatically choosing input-specific parameter values for reference-based transcript assembly. By choosing parameter values for each input, the area under the receiver operator characteristic curve (AUC) when comparing assembled transcripts to a reference transcriptome is increased by 28.9% over using only the default parameter choices on 1595 RNA-Seq samples in the Se- quence Read Archive. This approach is general, and when applied to StringTie it increases AUC by 13.1% on a set of 65 RNA-Seq experiments from ENCODE. Parameter advisors for both Scallop and StringTie are available on Github link: Preprint https://doi.org/10.1101/342865 link: Paper https://www.liebertpub.com/doi/full/10.1089/cmb.2019.0286 link: Code https://github.com/Kingsford-Group/scallopadvising link: BibTeX bibtex/2019-deblasio-scalloppa.bib 01/01/2019 240 1.2
Locality sensitive hashing for the edit distance Locality sensitive hashing for the edit distance ================================================ Guillaume Marçais, Dan DeBlasio, Prashant Pandey, and Carl Kingsford (2019) Locality sensitive hashing for the edit distance. *Bioinformatics* 35(14):i127-i135 (*ISMB*) 2019. Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy. We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH. link: Preprint https://doi.org/10.1101/534446 link: Paper https://academic.oup.com/bioinformatics/article/35/14/i127/5529166?guestAccessKey=cd30a36c-7f1a-420a-b787-1b7f28da9a2a link: Code http://github.com/Kingsford-Group/omhismb2019 link: BibTeX bibtex/2019-marcais-editlsh.bib 01/01/2019 286 1.4
Topological data analysis reveals principles of chromosome structure throughout cellular differentiation Topological data analysis reveals principles of chromosome structure throughout cellular differentiation ======================================================================================================== Natalie Sauerwald, Yihang Shen, and Carl Kingsford (2019) Topological data analysis reveals principles of chromosome structure throughout cellular differentiation. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 23:1--23:16 (2019). Topological data analysis (TDA) is a mathematically well-founded set of methods to derive robust information about the structure and topology of data. It has been applied successfully in several biological contexts. Derived primarily from algebraic topology, TDA rigorously identifies persistent features in complex data, making it well-suited to better understand the key features of three-dimensional chromosome structure. Chromosome structure has a significant influence in many diverse genomic processes and has recently been shown to relate to cellular differentiation. While there exist many methods to study specific substructures of chromosomes, we are still missing a global view of all geometric features of chromosomes. By applying TDA to the study of chromosome structure through differentiation across three cell lines, we provide insight into principles of chromosome folding and looping. We identify persistent connected components and one-dimensional topological features of chromosomes and characterize them across cell types and stages of differentiation. Availability: Scripts to reproduce the results from this study can be found at https://github.com/Kingsford-Group/hictda link: Preprint https://doi.org/10.1101/540716 link: Paper https://doi.org/10.4230/LIPIcs.WABI.2019.23 link: Code https://github.com/Kingsford-Group/hictda link: BibTeX bibtex/2019-sauerwald-tda.bib 01/01/2019 249 1.2
Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery ================================================================================= Heewook Lee and Carl Kingsford (2018) Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery. *Genome Biology* **19**:16. link: Preprint https://doi.org/10.1101/138826 link: Paper http://rdcu.be/GtQH link: Code https://github.com/Kingsford-Group/kourami link: BibTeX bibtex/2018-lee-kourami.bib 01/01/2018 61 0.3
Accurate Assembly and Typing of HLA using a Graph-Guided Assembler Kourami Accurate Assembly and Typing of HLA using a Graph-Guided Assembler Kourami ========================================================================== Heewook Lee and Carl Kingsford (2018) Accurate Assembly and Typing of HLA using a Graph-Guided Assembler Kourami. *HLA Typing*, pages 235-247. link: Paper https://link.springer.com/protocol/10.1007%2F978-1-4939-8546-3_17 link: BibTeX bibtex/2018-lee-kouramiuser.bib 01/01/2018 56 0.3
SQUID: transcriptomic structural variation detection from RNA-seq SQUID: transcriptomic structural variation detection from RNA-seq ================================================================= Cong Ma, Mingfu Shao, and Carl Kingsford (2018) SQUID: transcriptomic structural variation detection from RNA-seq. *Genome Biology* **19**:52. link: Preprint https://doi.org/10.1101/162776 link: Paper https://rdcu.be/Lroj link: Code https://github.com/Kingsford-Group/squid link: BibTeX bibtex/2018-ma-squid.bib 01/01/2018 57 0.3
Asymptotically optimal minimizers schemes Asymptotically optimal minimizers schemes ========================================= Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2018) Asymptotically optimal minimizers schemes. *Bioinformatics* (ISMB) **34**(13):i13-i22. link: Preprint https://doi.org/10.1101/256156 link: Paper https://academic.oup.com/bioinformatics/article/34/13/i13/5045769 link: BibTeX bibtex/2018-marcais-optminimizers.bib 01/01/2018 50 0.3
Quantifying the similarity of topological domains across normal and cancer human cell types Quantifying the similarity of topological domains across normal and cancer human cell types =========================================================================================== Natalie Sauerwald and Carl Kingsford (2018) Quantifying the similarity of topological domains across normal and cancer human cell types. *Bioinformatics* (ISMB) **34**(13):i475-i483. link: Paper https://doi.org/10.1093/bioinformatics/bty265 link: Code https://github.com/Kingsford-Group/localtadsim link: BibTeX bibtex/2018-sauerwald-tadsim.bib 01/01/2018 62 0.3
Improved search of large transcriptomic sequencing databases using split sequence bloom trees Improved search of large transcriptomic sequencing databases using split sequence bloom trees ============================================================================================= Brad Solomon and Carl Kingsford (2018) Improved search of large transcriptomic sequencing databases using split sequence bloom trees. *Journal of Computational Biology* **25**(7):755-765. link: Paper https://www.liebertpub.com/doi/abs/10.1089/cmb.2017.0265?rfr_dat=cr_pub%3Dpubmed&url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&journalCode=cmb link: Code https://github.com/Kingsford-Group/splitsbt link: BibTeX bibtex/2018-solomon-ssbt.bib 01/01/2018 80 0.4
Using the Ribodeblur pipeline to recover A-sites from yeast ribosome profiling data Using the Ribodeblur pipeline to recover A-sites from yeast ribosome profiling data =================================================================================== Hao Wang, Carl Kingsford, and C Joel McManus (2018) Using the Ribodeblur pipeline to recover A-sites from yeast ribosome profiling data. *Methods* **137**:67-70. link: Paper https://doi.org/10.1016/j.ymeth.2018.01.002 link: BibTeX bibtex/2018-wang-ribopipe.bib 01/01/2018 58 0.3
Dan DeBlasio published book Dan DeBlasio published book =========================== November 10, 2017 Dan DeBlasio has published a book with his Ph.D. advisor covering his research on learning the best way to run algorithms for multiple sequence alignment. http://www.springer.com/us/book/9783319649177 [Image Omitted] Dan is a member of the Kingsford Group and a Lane Fellow in the Computational Biology Department at Carnegie Mellon. 11/10/2017 62 0.3
Response to blog post about Salmon Response to blog post about Salmon ================================== August 16, 2017 Response to Blog Post About Salmon. http://bit.ly/SalmonBlogResp 08/16/2017 19 0.1
100 Things to do in Pittsburgh 100 Things to do in Pittsburgh ============================== August 10, 2017 A Ph.D. might span ~240 weekends. If we assume that 1 weekend / month is spent relaxing, watching movies and TV and running errands, 1 weekend / month is spent working 😁, and 4 weekends per year are spent traveling, that leaves about 100 weekends to explore Pittsburgh, if you are a student here. Here’s a list of 100 things to do during those weekends. http://www.monogrammedchalk.com/things-to-do-in-pittsburgh/ 08/10/2017 82 0.4
Mingfu Shao interviewed about Scallop on new podcast Mingfu Shao interviewed about Scallop on new podcast ==================================================== April 16, 2017 Lane Fellow Mingfu Shao was interviewed on the inaugural episode of the The Bioinformatics Chat podcast! He talks about his new transcriptome assembler, Scallop. https://www.cs.cmu.edu/~mingfus/ https://bioinformatics.chat/scallop https://github.com/Kingsford-Group/scallop Listen to hear all about Mingfu’s work! You can read more about Scallop here. https://www.nature.com/articles/nbt.4020 04/16/2017 73 0.4
Improving the performance of minimizers and winnowing schemes Improving the performance of minimizers and winnowing schemes ============================================================= Guillaume Marçais, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford (2017) Improving the performance of minimizers and winnowing schemes. *Bioinformatics* (ISMB) **33**(14):i110-117. link: Preprint https://doi.org/10.1101/104075 link: Paper https://academic.oup.com/bioinformatics/article/33/14/i110/3953951/Improving-the-performance-of-minimizers-and?guestAccessKey=594c6ca5-e053-4713-b23f-28a1481be865 link: BibTeX bibtex/2017-marcais-betterminimizers.bib 01/01/2017 76 0.4
Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing ================================================================================================ Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford (2017) Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. *PLoS Computational Biology* **13**(10):e1005777. link: Paper http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005777 link: BibTeX bibtex/2017-orenstein-docks.bib 01/01/2017 68 0.3
Salmon provides fast and bias-aware quantification of transcript expression Salmon provides fast and bias-aware quantification of transcript expression =========================================================================== Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford (2017) Salmon provides fast and bias-aware quantification of transcript expression. *Nature Methods* **14**:417-419. link: Preprint https://doi.org/10.1101/021592 link: Paper http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.4197.html link: Code https://github.com/COMBINE-lab/salmon link: BibTeX bibtex/2017-patro-salmon.bib 01/01/2017 76 0.4
Improving Bloom filter performance on sequence data using k-mer Bloom filters Improving Bloom filter performance on sequence data using k-mer Bloom filters ============================================================================= David Pellow, Darya Filippova, and Carl Kingsford (2017) Improving Bloom filter performance on sequence data using k-mer Bloom filters. *Journal of Computational Biology* **24**(6):547-557. link: Paper https://www.liebertpub.com/doi/full/10.1089/cmb.2016.0155?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%3dpubmed link: Code https://github.com/Kingsford-Group/kbf link: BibTeX bibtex/2017-pellow-seqbloomfilter.bib 01/01/2017 80 0.4
Chromosomal dynamics predicted by an elastic network model explains genome-wide accessibility and long-range couplings Chromosomal dynamics predicted by an elastic network model explains genome-wide accessibility and long-range couplings ====================================================================================================================== Natalie Sauerwald, She Zhang, Carl Kingsford, and Ivet Bahar (2017) Chromosomal dynamics predicted by an elastic network model explains genome-wide accessibility and long-range couplings. *Nucleic Acids Research* **45**(7):3663-3673. link: Preprint https://doi.org/10.1101/081182 link: Paper https://doi.org/10.1093/nar/gkx172 link: Code https://github.com/Kingsford-Group/chromatingnm link: BibTeX bibtex/2017-sauerwald-gnm.bib 01/01/2017 81 0.4
Theory and A Heuristic for the Minimum Path Flow Decomposition Problem Theory and A Heuristic for the Minimum Path Flow Decomposition Problem ====================================================================== Mingfu Shao and Carl Kingsford (2017) Theory and A Heuristic for the Minimum Path Flow Decomposition Problem. *IEEE/ACM Transactions on Computational Biology and Bioinformatics* **16**(2):658-670. link: Preprint https://doi.org/10.1101/087759 link: Paper https://ieeexplore.ieee.org/document/8126870 link: Code https://github.com/Kingsford-Group/catfish link: BibTeX bibtex/2017-shao-catfish.bib 01/01/2017 71 0.4
Accurate assembly of transcripts through phase-preserving graph decomposition Accurate assembly of transcripts through phase-preserving graph decomposition ============================================================================= Mingfu Shao and Carl Kingsford (2017) Accurate assembly of transcripts through phase-preserving graph decomposition. *Nature Biotechnology* **35**:1167-1169. link: Preprint https://doi.org/10.1101/123612 link: Paper https://www.nature.com/articles/nbt.4020 link: Code https://github.com/Kingsford-Group/scallop link: BibTeX bibtex/2017-shao-scallop.bib 01/01/2017 61 0.3
Improved search of large transcriptomic sequencing databases using split sequence bloom trees Improved search of large transcriptomic sequencing databases using split sequence bloom trees ============================================================================================= Brad Solomon and Carl Kingsford (2017) Improved search of large transcriptomic sequencing databases using split sequence bloom trees. *International Conference on Research in Computational Molecular Biology*, pages 257-271. link: Preprint https://doi.org/10.1101/086561 link: Paper https://link.springer.com/chapter/10.1007/978-3-319-56970-3_16 link: Code https://github.com/Kingsford-Group/splitsbt link: BibTeX bibtex/2017-solomon-ssbt-recomb.bib 01/01/2017 79 0.4
Accurate recovery of ribosome positions reveals slow translation of wobble-pairing codons in yeast Accurate recovery of ribosome positions reveals slow translation of wobble-pairing codons in yeast ================================================================================================== Hao Wang, Joel McManus, and Carl Kingsford (2017) Accurate recovery of ribosome positions reveals slow translation of wobble-pairing codons in yeast. *Journal of Computational Biology* **24**(6):486-500. link: Paper https://www.liebertpub.com/doi/full/10.1089/cmb.2016.0147?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%3dpubmed link: Code https://github.com/Kingsford-Group/ribodeblur link: BibTeX bibtex/2017-wang-ribodeblur.bib 01/01/2017 84 0.4
Efficient index maintenance under dynamic genome modification Efficient index maintenance under dynamic genome modification ============================================================= Nitish Gupta, Komal Sanjeev, Tim Wall, Carl Kingsford, and Rob Patro (2016) Efficient index maintenance under dynamic genome modification. *ArXiv Preprint ArXiv:1604.03132*. link: Preprint http://arxiv.org/abs/1604.03132 link: BibTeX bibtex/2016-gupta-dynamicindex.bib 01/01/2016 46 0.2
The second decade of the international conference on research in computational molecular biology (RECOMB) The second decade of the international conference on research in computational molecular biology (RECOMB) ========================================================================================================= Farhad Hormozdiari, Fereydoun Hormozdiari, Carl Kingsford, Paul Medvedev, and Fabio Vandin (2016) The second decade of the international conference on research in computational molecular biology (RECOMB). *International Conference on Research in Computational Molecular Biology*, pages 3-16. link: Paper https://link.springer.com/chapter/10.1007/978-3-319-31957-5_1 link: BibTeX bibtex/2016-hormozdiari-recombreview.bib 01/01/2016 72 0.4
A pathway-centric view of spatial proximity in the 3D nucleome across cell lines A pathway-centric view of spatial proximity in the 3D nucleome across cell lines ================================================================================ Hiren Karathia, Carl Kingsford, Michelle Girvan, and Sridhar Hannenhalli (2016) A pathway-centric view of spatial proximity in the 3D nucleome across cell lines. *Scientific Reports* **6**:39279. link: Preprint https://doi.org/10.1101/027045 link: Paper http://www.nature.com/articles/srep39279 link: BibTeX bibtex/2016-karathia-hicpathway.bib 01/01/2016 65 0.3
Teaching Computation to Biologists (book review) Teaching Computation to Biologists (book review) ================================================ Carl Kingsford (2016) Teaching Computation to Biologists (book review). *Computing in Science & Engineering* **18**(1):4-5. link: Paper https://ieeexplore.ieee.org/document/7361654 link: BibTeX bibtex/2016-kingsford-teachingreview.bib 01/01/2016 38 0.2
Compact universal k-mer hitting sets Compact universal k-mer hitting sets ==================================== Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford (2016) Compact universal k-mer hitting sets. *International Workshop on Algorithms in Bioinformatics*, pages 257-268. link: Paper http://link.springer.com/chapter/10.1007/978-3-319-43681-4_21 link: BibTeX bibtex/2016-orenstein-compactkmers.bib 01/01/2016 55 0.3
A computational method for designing diverse linear epitopes including citrullinated peptides with desired binding affinities to intravenous immunoglobulin A computational method for designing diverse linear epitopes including citrullinated peptides with desired binding affinities to intravenous immunoglobulin =========================================================================================================================================================== Rob Patro, Raquel Norel, Robert J Prill, Julio Saez-Rodriguez, Peter Lorenz, Felix Steinbeck, Bjoern Ziems, Mitja LuĹĄtrek, Nicola Barbarini, Alessandra Tiengo, and others (2016) A computational method for designing diverse linear epitopes including citrullinated peptides with desired binding affinities to intravenous immunoglobulin. *BMC Bioinformatics* **17**:155. link: Paper http://www.biomedcentral.com/1471-2105/17/155 link: BibTeX bibtex/2016-patro-dream-5.bib 01/01/2016 84 0.4
Deconvolution of ensemble chromatin interaction data reveals the latent mixing structures in cell subpopulations Deconvolution of ensemble chromatin interaction data reveals the latent mixing structures in cell subpopulations ================================================================================================================ Emre Sefer, Geet Duggal, and Carl Kingsford (2016) Deconvolution of ensemble chromatin interaction data reveals the latent mixing structures in cell subpopulations. *Journal of Computational Biology* **23**(6):425-438. link: Paper https://www.liebertpub.com/doi/abs/10.1089/cmb.2015.0210?rfr_dat=cr_pub%3Dpubmed&url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&journalCode=cmb link: BibTeX bibtex/2016-sefer-deconvolution.bib 01/01/2016 78 0.4
Diffusion archeology for diffusion progression history reconstruction Diffusion archeology for diffusion progression history reconstruction ===================================================================== Emre Sefer and Carl Kingsford (2016) Diffusion archeology for diffusion progression history reconstruction. *Knowledge and Information Systems*, pages 1-25. link: Preprint http://www.cs.cmu.edu/~ckingsf/software/dhrec/icdm2014.pdf link: Paper https://dl.acm.org/citation.cfm?id=3008262 link: Code http://www.cs.cmu.edu/~ckingsf/software/dhrec/ link: BibTeX bibtex/2016-sefer-diffusion.bib 01/01/2016 66 0.3
Fast search of thousands of short-read sequencing experiments Fast search of thousands of short-read sequencing experiments ============================================================= Brad Solomon and Carl Kingsford (2016) Fast search of thousands of short-read sequencing experiments. *Nature Biotechnology* **34**:300-302. link: Preprint https://doi.org/10.1101/017087 link: Paper http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3442.html link: Code https://github.com/Kingsford-Group/bloomtree link: BibTeX bibtex/2016-solomon-sbt.bib 01/01/2016 66 0.3
Exploring ribosome positioning on translating transcripts with ribosome profiling Exploring ribosome positioning on translating transcripts with ribosome profiling ================================================================================= Pieter Spealman, Hao Wang, Gemma May, Carl Kingsford, and C Joel McManus (2016) Exploring ribosome positioning on translating transcripts with ribosome profiling. *Post-Transcriptional Gene Regulation*, pages 71-97. link: Paper https://link.springer.com/protocol/10.1007%2F978-1-4939-3067-8_5 link: BibTeX bibtex/2016-spealman-ribomethods.bib 01/01/2016 59 0.3
Isoform-level ribosome occupancy estimation guided by transcript abundance with Ribomap Isoform-level ribosome occupancy estimation guided by transcript abundance with Ribomap ======================================================================================= Hao Wang, Joel McManus, and Carl Kingsford (2016) Isoform-level ribosome occupancy estimation guided by transcript abundance with Ribomap. *Bioinformatics* **32**(12):1880-1882 . link: Preprint https://doi.org/10.1101/017509 link: Paper http://bioinformatics.oxfordjournals.org/cgi/reprint/btw085?%20ijkey=JJzbUzVOiqPTvjH&keytype=ref link: Code https://github.com/Kingsford-Group/ribomap link: BibTeX bibtex/2016-wang-ribomap.bib 01/01/2016 71 0.4
Rapid, separable compression enables fast analyses of sequence alignments Rapid, separable compression enables fast analyses of sequence alignments ========================================================================= Darya Filippova and Carl Kingsford (2015) Rapid, separable compression enables fast analyses of sequence alignments. *Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics*, pages 194-201. link: Paper https://dl.acm.org/citation.cfm?id=2808739 link: Code https://github.com/Kingsford-Group/referee link: BibTeX bibtex/2015-filippova-referee.bib 01/01/2015 65 0.3
Reference-based compression of short-read sequences using path encoding Reference-based compression of short-read sequences using path encoding ======================================================================= Carl Kingsford and Rob Patro (2015) Reference-based compression of short-read sequences using path encoding. *Bioinformatics* **31**(12):1920-1928 . link: Preprint https://doi.org/10.1101/006551 link: Paper http://dx.doi.org/10.1093/bioinformatics/btv071 link: Code https://github.com/Kingsford-Group/kpath link: BibTeX bibtex/2015-kingsford-pathenc.bib 01/01/2015 64 0.3
Data-dependent bucketing improves reference-free compression of sequencing reads Data-dependent bucketing improves reference-free compression of sequencing reads ================================================================================ Rob Patro and Carl Kingsford (2015) Data-dependent bucketing improves reference-free compression of sequencing reads. *Bioinformatics* **31**(17):2770-2777 . link: Paper http://bioinformatics.oxfordjournals.org/content/early/2015/05/25/bioinformatics.btv248 link: Code https://github.com/Kingsford-Group/mince link: BibTeX bibtex/2015-patro-mince.bib 01/01/2015 59 0.3
Convex risk minimization to infer networks from probabilistic diffusion data at multiple scales Convex risk minimization to infer networks from probabilistic diffusion data at multiple scales =============================================================================================== Emre Sefer and Carl Kingsford (2015) Convex risk minimization to infer networks from probabilistic diffusion data at multiple scales. *2015 IEEE 31st International Conference on Data Engineering*, pages 663-674. link: Preprint http://www.cs.cmu.edu/~esefer/papers/icde2015.pdf link: Paper https://www.computer.org/csdl/proceedings/icde/2015/7964/00/07113323.pdf link: Code http://www.cs.cmu.edu/~ckingsf/software/cormin/ link: BibTeX bibtex/2015-sefer-convex.bib 01/01/2015 85 0.4
Semi-nonparametric modeling of topological domain formation from epigenetic data Semi-nonparametric modeling of topological domain formation from epigenetic data ================================================================================ Emre Sefer and Carl Kingsford (2015) Semi-nonparametric modeling of topological domain formation from epigenetic data. *International Workshop on Algorithms in Bioinformatics*, pages 148-161. link: Paper https://link.springer.com/chapter/10.1007/978-3-662-48221-6_11 link: BibTeX bibtex/2015-sefer-semi.bib 01/01/2015 56 0.3
Optimal seed solver: optimizing seed selection in read mapping Optimal seed solver: optimizing seed selection in read mapping ============================================================== Hongyi Xin, Sunny Nahar, Richard Zhu, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu (2015) Optimal seed solver: optimizing seed selection in read mapping. *Bioinformatics* **32**(11):1632-1642. link: Preprint https://arxiv.org/abs/1506.08235 link: Paper https://doi.org/10.1093/bioinformatics/btv670 link: Code https://github.com/CMU-SAFARI/optimal-seed-solver link: BibTeX bibtex/2015-xin-oss.bib 01/01/2015 75 0.4
Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping ======================================================================================================================= Hongyi Xin, John Greth, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu (2015) Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. *Bioinformatics* **31**(10):1553-1560. link: Paper http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btu856?%20ijkey=iQ4UOCzdu7rxIAr&keytype=ref link: Code https://github.com/CMU-SAFARI/Shifted-Hamming-Distance link: BibTeX bibtex/2015-xin-shifted.bib 01/01/2015 86 0.4
Identification of alternative topological domains in chromatin Identification of alternative topological domains in chromatin ============================================================== Darya Filippova, Rob Patro, Geet Duggal, and Carl Kingsford (2014) Identification of alternative topological domains in chromatin. *Algorithms for Molecular Biology*, 9**14**. link: Paper https://almob.biomedcentral.com/articles/10.1186/1748-7188-9-14 link: Code https://github.com/kingsfordgroup/armatus link: BibTeX bibtex/2014-filippova-armatus-journal.bib 01/01/2014 58 0.3
Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms ====================================================================================================== Rob Patro, Stephen M Mount, and Carl Kingsford (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. *Nature Biotechnology* **32**:462-464. link: Paper http://www.nature.com/nbt/journal/vaop/ncurrent/abs/nbt.2862.html link: Code https://github.com/kingsfordgroup/sailfish link: BibTeX bibtex/2014-patro-sailfish.bib 01/01/2014 68 0.3
Resolving spatial inconsistencies in chromosome conformation measurements Resolving spatial inconsistencies in chromosome conformation measurements ========================================================================= Geet Duggal, Rob Patro, Emre Sefer, Hao Wang, Darya Filippova, Samir Khuller, and Carl Kingsford (2013) Resolving spatial inconsistencies in chromosome conformation measurements. *Algorithms for Molecular Biology* **8**:8. link: Paper https://almob.biomedcentral.com/articles/10.1186/1748-7188-8-8 link: BibTeX bibtex/2013-duggal-mms-amb.bib 01/01/2013 57 0.3
Higher-order chromatin domains link eQTLs with the expression of far-away genes Higher-order chromatin domains link eQTLs with the expression of far-away genes =============================================================================== Geet Duggal, Hao Wang, and Carl Kingsford (2013) Higher-order chromatin domains link eQTLs with the expression of far-away genes. *Nucleic Acids Research* **42**(1):87-96 . link: Paper http://nar.oxfordjournals.org/cgi/reprint/gkt857?%20ijkey=khPNpMjJ3oxMG90&keytype=ref link: BibTeX bibtex/2013-duggal-tadeqtl.bib 01/01/2013 61 0.3
Multiscale identification of topological domains in chromatin Multiscale identification of topological domains in chromatin ============================================================= Darya Filippova, Rob Patro, Geet Duggal, and Carl Kingsford (2013) Multiscale identification of topological domains in chromatin. *International Workshop on Algorithms in Bioinformatics*, pages 300-312. link: Paper http://www.almob.org/content/9/1/14 link: BibTeX bibtex/2013-filippova-armatus.bib 01/01/2013 50 0.3
Predicting protein interactions via parsimonious network history inference Predicting protein interactions via parsimonious network history inference ========================================================================== Rob Patro and Carl Kingsford (2013) Predicting protein interactions via parsimonious network history inference. *Bioinformatics* **29**(13):i237-i246. link: Paper https://doi.org/10.1093/bioinformatics/btt224 link: Code https://github.com/kingsfordgroup/parana2 link: BibTeX bibtex/2013-patro-parana2.bib 01/01/2013 50 0.3
Topological properties of chromosome conformation graphs reflect spatial proximities within chromatin Topological properties of chromosome conformation graphs reflect spatial proximities within chromatin ===================================================================================================== Hao Wang, Geet Duggal, Rob Patro, Michelle Girvan, Sridhar Hannenhalli, and Carl Kingsford (2013) Topological properties of chromosome conformation graphs reflect spatial proximities within chromatin. *Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics*, pages 306-315. link: Paper https://dl.acm.org/citation.cfm?id=2506633 link: BibTeX bibtex/2013-wang-hicspatial.bib 01/01/2013 68 0.3
Resolving spatial inconsistencies in chromosome conformation data Resolving spatial inconsistencies in chromosome conformation data ================================================================= Geet Duggal, Rob Patro, Emre Sefer, Hao Wang, Darya Filippova, Samir Khuller, and Carl Kingsford (2012) Resolving spatial inconsistencies in chromosome conformation data. *International Workshop on Algorithms in Bioinformatics*, pages 288-300. link: Paper http://www.springerlink.com/content/f02858k117587111/ link: BibTeX bibtex/2012-duggal-mms-wabi.bib 01/01/2012 55 0.3
Graph rigidity reveals well-constrained regions of chromosome conformation embeddings Graph rigidity reveals well-constrained regions of chromosome conformation embeddings ===================================================================================== Geet Duggal and Carl Kingsford (2012) Graph rigidity reveals well-constrained regions of chromosome conformation embeddings. *BMC Bioinformatics* **13**:241. link: Paper http://www.biomedcentral.com/1471-2105/13/241 link: BibTeX bibtex/2012-duggal-rigidity.bib 01/01/2012 47 0.2
Coral: an integrated suite of visualizations for comparing clusterings Coral: an integrated suite of visualizations for comparing clusterings ====================================================================== Darya Filippova, Aashish Gadani, and Carl Kingsford (2012) Coral: an integrated suite of visualizations for comparing clusterings. *BMC Bioinformatics* **13**:276. link: Paper http://www.biomedcentral.com/1471-2105/13/276/ link: BibTeX bibtex/2012-filippova-coral.bib 01/01/2012 47 0.2
Dynamic exploration of recording sessions between jazz musicians over time Dynamic exploration of recording sessions between jazz musicians over time ========================================================================== Darya Filippova, Michael Fitzgerald, Carl Kingsford, and Fernando Benadon (2012) Dynamic exploration of recording sessions between jazz musicians over time. *ASE/IEEE International Conference on Social Computing*, pages 368-376. link: Paper https://ieeexplore.ieee.org/document/6406377 link: BibTeX bibtex/2012-filippova-jazzmap.bib 01/01/2012 55 0.3
Global network alignment using multiscale spectral signatures Global network alignment using multiscale spectral signatures ============================================================= Rob Patro and Carl Kingsford (2012) Global network alignment using multiscale spectral signatures. *Bioinformatics* **28**(23):3105-3114. link: Paper http://bioinformatics.oxfordjournals.org/content/early/2012/10/08/bioinformatics.bts592.abstract link: Code https://github.com/Kingsford-Group/ghost link: BibTeX bibtex/2012-patro-ghost.bib 01/01/2012 54 0.3
The missing models: A data-driven approach for learning how networks grow The missing models: A data-driven approach for learning how networks grow ========================================================================= Robert Patro, Geet Duggal, Emre Sefer, Hao Wang, Darya Filippova, and Carl Kingsford (2012) The missing models: A data-driven approach for learning how networks grow. *Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining*, pages 42-50. link: Paper http://delivery.acm.org/10.1145/2340000/2339541/p42-patro.pdf link: BibTeX bibtex/2012-patro-growcode.bib 01/01/2012 75 0.4
Parsimonious reconstruction of network evolution Parsimonious reconstruction of network evolution ================================================ Rob Patro, Emre Sefer, Justin Malin, Guillaume Marçais, Saket Navlakha, and Carl Kingsford (2012) Parsimonious reconstruction of network evolution. *Algorithms for Molecular Biology* **7**:25. link: Paper http://www.almob.org/content/7/1/25 link: BibTeX bibtex/2012-patro-parana.bib 01/01/2012 48 0.2
Finding ranges of optimal transcript expression quantification in cases of non-identifiability Finding ranges of optimal transcript expression quantification in cases of non-identifiability ============================================================================================== Cong Ma, Hongyu Zheng, and Carl Kingsford (2019) Finding ranges of optimal transcript expression quantification in cases of non-identifiability. RECOMB 2021. Current expression quantification methods suffer from a fundamental but under-characterized type of error: the most likely estimates for transcript abundances are not unique. Probabilistic models are at the core of current quantification methods. The scenario where a probabilistic model has multiple optimal solutions is known as non-identifiability. In expression quantification, multiple configurations of transcript expression may be equally likely to generate the sequencing reads and the underlying true expression cannot be uniquely determined. It is still unknown from existing methods what the set of multiple solutions is or how far the equally optimal solutions are from each other. Such information is necessary for evaluating the reliability of analyses that are based on a single inferred expression vector and for extending analyses to take all optimal solutions into account. We propose methods to compute the range of optimal estimates for the expression of each transcript when the probabilistic model for expression inference is non-identifiable. The accuracy and identifiability of expression estimates depend on the completeness of input reference transcriptome, therefore our method also takes an assumed percentage of expression from combinations of known junctions into consideration. Applying our method on 16 Human Body Map samples and comparing with the single expression vector quantified by Salmon, we observe that the ranges of optimal abundances are on the same scale as Salmon's estimate. Analyzing the overlap of ranges of optima in differential expression (DE) detection reveals that the majority of predictions are reliable, but there are a few unreliable predictions for which switching to other optimal abundances may lead to similar expression between DE conditions. The source code can be found at https://github.com/Kingsford-Group/subgraphquant. link: DOI 10.1101/2019.12.13.875625 link: Preprint https://www.biorxiv.org/content/10.1101/2019.12.13.875625v1 link: Code https://github.com/Kingsford-Group/subgraphquant 12/15/2011 342 1.7
Extracting between-pathway models from E-MAP interactions using expected graph compression Extracting between-pathway models from E-MAP interactions using expected graph compression ========================================================================================== David R Kelley and Carl Kingsford (2011) Extracting between-pathway models from E-MAP interactions using expected graph compression. *Journal of Computational Biology* **18**(3):379-390. link: Paper http://www.liebertonline.com/doi/abs/10.1089/cmb.2010.0268 link: BibTeX bibtex/2011-kelley-emap.bib 01/01/2011 59 0.3
A cost-aggregating integer linear program for motif finding A cost-aggregating integer linear program for motif finding =========================================================== Carl Kingsford, Elena Zaslavsky, and Mona Singh (2011) A cost-aggregating integer linear program for motif finding. *Journal of Discrete Algorithms* **9**(4):326-334. link: Paper http://dx.doi.org/10.1016/j.jda.2011.04.001 link: BibTeX bibtex/2011-kingsford-motifilp.bib 01/01/2011 54 0.3
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers A fast, lock-free approach for efficient parallel counting of occurrences of k-mers =================================================================================== Guillaume Marçais and Carl Kingsford (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. *Bioinformatics* **27**(6):764-770. link: Paper http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btr011?ijkey=TCpSPz805ZwvjzY&keytype=ref link: Code https://github.com/gmarcais/Jellyfish link: BibTeX bibtex/2011-marcais-jellyfish.bib 01/01/2011 68 0.3
Network archaeology: uncovering ancient networks from present-day interactions Network archaeology: uncovering ancient networks from present-day interactions ============================================================================== Saket Navlakha and Carl Kingsford (2011) Network archaeology: uncovering ancient networks from present-day interactions. *PLoS Computational Biology* **7**(4):e1001119. link: Paper http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1001119 link: BibTeX bibtex/2011-navlakha-netarch.bib 01/01/2011 51 0.3
Metric labeling and semi-metric embedding for protein annotation prediction Metric labeling and semi-metric embedding for protein annotation prediction =========================================================================== Emre Sefer and Carl Kingsford (2011) Metric labeling and semi-metric embedding for protein annotation prediction. *International Conference on Research in Computational Molecular Biology*, pages 392-407. link: Paper http://www.springerlink.com/content/k3p63424kp6822rp/ link: BibTeX bibtex/2011-sefer-metric.bib 01/01/2011 52 0.3
Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies ========================================================================================================== Joshua Wetzel, Carl Kingsford, and Mihai Pop (2011) Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. *BMC Bioinformatics* **12**:95. link: Paper http://www.biomedcentral.com/1471-2105/12/95/abstract link: BibTeX bibtex/2011-wetzel-matepairs.bib 01/01/2011 64 0.3
Uncovering many views of biological networks using ensembles of near-optimal partitions Uncovering many views of biological networks using ensembles of near-optimal partitions ======================================================================================= Geet Duggal, Saket Navlakha, Michelle Girvan, and Carl Kingsford (2010) Uncovering many views of biological networks using ensembles of near-optimal partitions. *MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings at KDD*. link: Paper http://eecs.oregonstate.edu/research/multiclust/nearopttrees-2.pdf link: BibTeX bibtex/2010-duggal-multiclust.bib 01/01/2010 65 0.3
Extracting between-pathway models from E-MAP interactions using expected graph compression Extracting between-pathway models from E-MAP interactions using expected graph compression ========================================================================================== David R Kelley and Carl Kingsford (2010) Extracting between-pathway models from E-MAP interactions using expected graph compression. *Annual International Conference on Research in Computational Molecular Biology*, pages 248-262. link: Paper http://dx.doi.org/10.1007/978-3-642-12683-3_16 link: BibTeX bibtex/2010-kelley-emap-recomb.bib 01/01/2010 64 0.3
Assembly complexity of prokaryotic genomes using short reads Assembly complexity of prokaryotic genomes using short reads ============================================================ Carl Kingsford, Michael C Schatz, and Mihai Pop (2010) Assembly complexity of prokaryotic genomes using short reads. *BMC Bioinformatics* **11**:21. link: Paper http://www.biomedcentral.com/1471-2105/11/21/abstract link: BibTeX bibtex/2010-kingsford-assemblycomplexity.bib 01/01/2010 47 0.2
Particle swarm optimization for multimodal combinatorial problems and its application to protein design Particle swarm optimization for multimodal combinatorial problems and its application to protein design ======================================================================================================= Grecia Lapizco-Encinas, Carl Kingsford, and James Reggia (2010) Particle swarm optimization for multimodal combinatorial problems and its application to protein design. *IEEE Congress on Evolutionary Computation*, pages 1-8. link: Paper https://ieeexplore.ieee.org/document/5586157 link: BibTeX bibtex/2010-lapizco-swarm.bib 01/01/2010 58 0.3
GiRaF: robust, computational identification of influenza reassortments via graph mining GiRaF: robust, computational identification of influenza reassortments via graph mining ======================================================================================= Niranjan Nagarajan and Carl Kingsford (2010) GiRaF: robust, computational identification of influenza reassortments via graph mining. *Nucleic Acids Research* **39**(6):e34 . link: Paper http://nar.oxfordjournals.org/content/39/6/e34 link: Code https://github.com/Kingsford-Group/GIRAF link: BibTeX bibtex/2010-nagarajan-giraf.bib 01/01/2010 57 0.3
The power of protein interaction networks for associating genes with diseases The power of protein interaction networks for associating genes with diseases ============================================================================= Saket Navlakha and Carl Kingsford (2010) The power of protein interaction networks for associating genes with diseases. *Bioinformatics* **26**(8):1057-1063. link: Paper http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq076?ijkey=r07K7boQNKuYsS4&keytype=ref link: BibTeX bibtex/2010-navlakha-diseasenet.bib 01/01/2010 54 0.3
Exploring biological network dynamics with ensembles of graph partitions Exploring biological network dynamics with ensembles of graph partitions ======================================================================== Saket Navlakha and Carl Kingsford (2010) Exploring biological network dynamics with ensembles of graph partitions. *Pacific Symposium on Biocomputing (PSB)*, pages 166-177. link: Paper http://dx.doi.org/10.1142/9789814295291_0019 link: BibTeX bibtex/2010-navlakha-ensembles.bib 01/01/2010 48 0.2
Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information ================================================================================================================ Saket Navlakha, James White, Niranjan Nagarajan, Mihai Pop, and Carl Kingsford (2010) Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information. *Journal of Computational Biology* **17**(3):503--516. link: Paper http://www.liebertonline.com/doi/abs/10.1089/cmb.2009.0173 link: BibTeX bibtex/2010-navlakha-vicut-jcb.bib 01/01/2010 67 0.3
Alignment and clustering of phylogenetic markers-implications for microbial diversity studies Alignment and clustering of phylogenetic markers-implications for microbial diversity studies ============================================================================================= James R White, Saket Navlakha, Niranjan Nagarajan, Mohammad-Reza Ghodsi, Carl Kingsford, and Mihai Pop (2010) Alignment and clustering of phylogenetic markers-implications for microbial diversity studies. *BMC Bioinformatics* **11**:152. link: Paper http://www.biomedcentral.com/1471-2105/11/152/abstract link: BibTeX bibtex/2010-white-metadiversity.bib 01/01/2010 60 0.3
Vertices of degree k in edge-minimal, k-edge-connected graphs Vertices of degree k in edge-minimal, k-edge-connected graphs ============================================================= Carl Kingsford and Guillaume Marçais (2009) Vertices of degree k in edge-minimal, k-edge-connected graphs. *ArXiv Preprint ArXiv:0905.1064*. link: Preprint http://arxiv.org/abs/0905.1064 link: BibTeX bibtex/2009-kingsford-degreek.bib 01/01/2009 49 0.2
A synthesis for exactly 3-edge-connected graphs A synthesis for exactly 3-edge-connected graphs =============================================== Carl Kingsford and Guillaume Marçais (2009) A synthesis for exactly 3-edge-connected graphs. *ArXiv Preprint ArXiv:0905.1053*. link: Preprint http://arxiv.org/abs/0905.1053 link: BibTeX bibtex/2009-kingsford-edgesynthesis.bib 01/01/2009 43 0.2
2009 Swine-origin influenza A (H1N1) resembles previous influenza isolates 2009 Swine-origin influenza A (H1N1) resembles previous influenza isolates ========================================================================== Carl Kingsford, Niranjan Nagarajan, and Steven L Salzberg (2009) 2009 Swine-origin influenza A (H1N1) resembles previous influenza isolates. *Plos One* **4**(7):e6402 . link: Paper http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0006402 link: BibTeX bibtex/2009-kingsford-swineflu.bib 01/01/2009 55 0.3
A cooperative combinatorial particle swarm optimization algorithm for side-chain packing A cooperative combinatorial particle swarm optimization algorithm for side-chain packing ======================================================================================== Grecia Lapizco-Encinas, Carl Kingsford, and James Reggia (2009) A cooperative combinatorial particle swarm optimization algorithm for side-chain packing. *2009 IEEE Swarm Intelligence Symposium*, pages 22-29. link: Paper http://dx.doi.org/10.1109/SIS.2009.4937840 link: BibTeX bibtex/2009-lapizco-cooperative.bib 01/01/2009 57 0.3
Revealing biological modules via graph summarization Revealing biological modules via graph summarization ==================================================== Saket Navlakha, Michael C Schatz, and Carl Kingsford (2009) Revealing biological modules via graph summarization. *Journal of Computational Biology* **16**(2):253-264. link: Paper http://dx.doi.org/10.1089/cmb.2008.11TT link: BibTeX bibtex/2009-navlakha-graphsum.bib 01/01/2009 47 0.2
Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information ================================================================================================================ Saket Navlakha, James White, Niranjan Nagarajan, Mihai Pop, and Carl Kingsford (2009) Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information. *Annual International Conference on Research in Computational Molecular Biology*, pages 400-417. link: Paper http://dx.doi.org/10.1007/978-3-642-02008-7 link: BibTeX bibtex/2009-navlakha-vicut-recomb.bib 01/01/2009 71 0.4
What are decision trees? What are decision trees? ======================== Carl Kingsford and Steven L Salzberg (2008) What are decision trees?. *Nature Biotechnology* **26**(9):1011-1013 . link: Paper http://www.nature.com/nbt/journal/v26/n9/abs/nbt0908-1011.html link: BibTeX bibtex/2008-kingsford-decisiontrees.bib 01/01/2008 42 0.2
Uncovering genomic reassortments among influenza strains by enumerating maximal bicliques Uncovering genomic reassortments among influenza strains by enumerating maximal bicliques ========================================================================================= Niranjan Nagarajan and Carl Kingsford (2008) Uncovering genomic reassortments among influenza strains by enumerating maximal bicliques. *2008 IEEE International Conference on Bioinformatics and Biomedicine*, pages 223-230. link: Paper http://doi.ieeecomputersociety.org/10.1109/BIBM.2008.78 link: BibTeX bibtex/2008-nagarajan-maxbiclique.bib 01/01/2008 55 0.3
A unified model explaining the offsets of overlapping and near-overlapping prokaryotic genes A unified model explaining the offsets of overlapping and near-overlapping prokaryotic genes ============================================================================================ Carl Kingsford, Arthur L Delcher, and Steven L Salzberg (2007) A unified model explaining the offsets of overlapping and near-overlapping prokaryotic genes. *Molecular Biology and Evolution* **24**:2091-2098 . link: Paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2429982/ link: BibTeX bibtex/2007-kingsford-geneoverlap.bib 01/01/2007 61 0.3
Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake ================================================================================================================================== Carleton L Kingsford, Kunmi Ayanbule, and Steven L Salzberg (2007) Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. *Genome Biology* **8**:R22. link: Paper http://genomebiology.com/2007/8/2/R22/abstract link: Code http://transterm.cbcb.umd.edu/ link: BibTeX bibtex/2007-kingsford-transtermhp.bib 01/01/2007 68 0.3
Genome analysis linking recent European and African influenza (H5N1) viruses Genome analysis linking recent European and African influenza (H5N1) viruses ============================================================================ Steven L Salzberg, Carl Kingsford, Giovanni Cattoli, David J Spiro, Daniel A Janies, Mona Mehrez Aly, Ian H Brown, Emmanuel Couacy-Hymann, Gian Mario De Mia, Do Huu Dung, and others (2007) Genome analysis linking recent European and African influenza (H5N1) viruses. *Emerging Infectious Diseases* **13**(5):713-718 . link: Paper http://www.cdc.gov/eid/content/13/5/713.htm link: BibTeX bibtex/2007-salzberg-h-5-n-1.bib 01/01/2007 81 0.4
A compact mathematical programming formulation for DNA motif finding A compact mathematical programming formulation for DNA motif finding ==================================================================== Carl Kingsford, Elena Zaslavsky, and Mona Singh (2006) A compact mathematical programming formulation for DNA motif finding. *Annual Symposium on Combinatorial Pattern Matching*, pages 233-245. link: Paper http://dx.doi.org/10.1007/11780441_22 link: BibTeX bibtex/2006-kingsford-motifilp-cpm.bib 01/01/2006 52 0.3
The inapproximability of side-chain positioning The inapproximability of side-chain positioning =============================================== Bernard Chazelle, Carl Kingsford, and Mona Singh (2004) The inapproximability of side-chain positioning. Technical Report. Princeton University, Princeton. link: BibTeX bibtex/2004-chazelle-inapproximability.bib 01/01/2004 32 0.2
A semidefinite programming approach to side chain positioning with new rounding strategies A semidefinite programming approach to side chain positioning with new rounding strategies ========================================================================================== Bernard Chazelle, Carl Kingsford, and Mona Singh (2004) A semidefinite programming approach to side chain positioning with new rounding strategies. *INFORMS Journal on Computing* **16**:380-392. link: Paper http://www.extenza-eps.com/extenza/loadHTML?objectIDValue=50417&type=abstract link: BibTeX bibtex/2004-chazelle-scpsdp.bib 01/01/2004 59 0.3
Solving and analyzing side-chain positioning problems using linear and integer programming Solving and analyzing side-chain positioning problems using linear and integer programming ========================================================================================== Carleton L Kingsford, Bernard Chazelle, and Mona Singh (2004) Solving and analyzing side-chain positioning problems using linear and integer programming. *Bioinformatics* **21**:1028-1039. link: Paper http://dx.doi.org/doi:10.1093/bioinformatics/bti144 link: BibTeX bibtex/2004-kingsford-scpilp.bib 01/01/2004 55 0.3
The side-chain positioning problem: a semidefinite programming formulation with new rounding schemes The side-chain positioning problem: a semidefinite programming formulation with new rounding schemes ==================================================================================================== Bernard Chazelle, Carl Kingsford, and Mona Singh (2003) The side-chain positioning problem: a semidefinite programming formulation with new rounding schemes. *Proceedings of the Paris C. Kanellakis memorial workshop on Principles of computing & knowledge: Paris C. Kanellakis memorial workshop on the occasion of his 50th birthday*, pages 86-94. link: BibTeX bibtex/2003-chazelle-scpsdp-conference.bib 01/01/2003 70 0.3
Experimental construction of very large scale DNA databases with associative search capability Experimental construction of very large scale DNA databases with associative search capability ============================================================================================== John H Reif, Thomas H LaBean, Michael Pirrung, Vipul S Rana, Bo Guo, Carl Kingsford, and Gene S Wickham (2001) Experimental construction of very large scale DNA databases with associative search capability. *International Workshop on DNA-Based Computers*, pages 231-247. link: Paper https://link.springer.com/chapter/10.1007%2F3-540-48017-X_22 link: BibTeX bibtex/2001-reif-dnadb.bib 01/01/2001 73 0.4

View source