Search Kingsford Group

title	text	date	wordCount	minutes
Carl Kingsford	[Image Omitted] Carl Kingsford ============== carlk@cs.cmu.edu Herbert A. Simon Professor of Computer Science Ray and Stephanie Lane Computational Biology Department https://cbd.cmu.edu School of Computer Science https://www.cs.cmu.edu Carnegie Mellon University https://www.cmu.edu Affiliate Faculty, Machine Learning Department https://www.ml.cmu.edu Director, Center for Machine Learning in Health https://cmlh.org Director, Center for Innovation in Health https://www.cs.cmu.edu/cih/ Co-Director, Joint CMU-Univ Pittsburgh Ph.D. Program in Computational Biology https://compbio.cmu.edu [Image Omitted] CEO, Ocean Genomics https://oceangenomics.com Twitter/X https://twitter.com/ckingsford Google Scholar https://scholar.google.com/citations?user=V_cvqKcAAAAJ&hl=en Room 7719, Gates-Hillman ComplexCarnegie Mellon University5000 Forbes AvePittsburgh, PA 15213 My research group is interested in advancing computer science and machine learning methodologies to extract insight from biological data. We currently focus on the following classes of problems: - Genomics & genome assembly: RNA-seq expression quantification; genome assembly; large-scale sequence search, sketching, etc. This work is currently supported by NIH grant 1R01HG012470. It was previously supported by NIH grant 1R21HG006913, NSF grant CCF-1319998, a Data-Driven Investigator grant from the Gordon and Betty Moore Foundation NIH grant R01GM122935, and an award from The Shurl and Kay Curci foundation. - Pan-genomics and genome graphs: Storing, searching, and using large collections of genomes and transcriptomes from many individuals. This work is supported by NSF grant III-2232121. - Automatically learning algorithms and wetlab protocols: Scheduling automated experimentation, hyperparameter optimization, autoML, and automated algorithm design. Supported by an award from Schmidt Sciences. Previous research interests include: - Chromatin structure and function: Algorithms for determining the spatial organization of eukaryotic genomes from Chromosome Conformation Capture data. Previously supported by NIH grant R01HG007104. - Viral evolution: Reassortment in the influenza genome. This work was supported by NIH grant 1R21AI085376. - Protein interactions and networks: Evolution of interactions; protein function prediction; clustering within networks; protein structure prediction. This work was supported by NSF grant EF-0849899 and by NSF grant CCF-1053918/CCF-1256087 (CAREER award). Disclosure: I am a co-founder of Ocean Genomics, Inc. https://oceangenomics.com	05/24/9999	356	1.8
DTMol: Pocket-based Molecular Docking using Diffusion Transformers	DTMol: Pocket-based Molecular Docking using Diffusion Transformers ================================================================== Haotian Teng, Ran Wang, Yihang Shen, Ye Yuan, Carl Kingsford (2025) DTMol: Pocket-based Molecular Docking using Diffusion Transformers. _bioRxiv_ 648103. Molecular docking --- predicting the binding structure of a small molecule ligand to a protein --- is a crucial task in computational chemistry and drug discovery. Traditional docking methods relying on scoring functions tend to be slow and inaccurate. Recent deep learning methods, especially diffusion-based generative models, have significantly improved the accuracy and computational efficiency of molecular docking. However, these methods still face challenges, particularly in the pocket-based docking setting, which involves docking a ligand when a protein pocket structure --- a cavity of the protein with potential ligand-binding capabilities --- is given. We introduce DTMol, a novel generative deep learning model designed to tackle the pocket-based molecular docking problem. Our model integrates a pretrained molecular representation framework with a new SE(3)-equivariant diffusion transformer architecture. The pretrained framework generates representations of both protein pockets and ligands, while the diffusion transformer effectively captures interaction information between them. Testing on the PDB-Bind dataset demonstrates that our method outperforms traditional docking methods and deep learning-based baselines. The efficacy of DTMol is further validated through a virtual screening task targeting Janus kinase 2 (PDB ID: 6BBV), followed by experimental validation of the top-ranked compounds via a protein kinase activity assay. link: DOI https://doi.org/10.1101/2025.04.13.648103 link: Preprint https://doi.org/10.1101/2025.04.13.648103 link: Code https://github.com/haotianteng/dtmol	04/21/2025	259	1.3
Biological databases in the age of generative artificial intelligence	Biological databases in the age of generative artificial intelligence ====================================================================== Mihai Pop, Teresa K Attwood, Judith A Blake et al. (2025) Biological databases in the age of generative artificial intelligence . Bioinformatics Advances, vbaf044 . Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases. link: DOI https://doi.org/10.1093/bioadv/vbaf044	03/20/2025	193	1.0
DNA Language Models for RNA Analyses	DNA Language Models for RNA Analyses ==================================== Shiyi Du, Litian Liang, Jiayi Li, and Carl Kingsford. (2025) DNA Language Models for RNA Analyses. OpenReview . We introduce novel Adaptive Mixture of Codon Reformative Experts (CodonMoE) that can be incorporated into DNA gLMs in order to adapt them for mRNA-based predictive tasks. We show that, by using this plug-and-play operator, DNA-based gLMs can achieve performance similar to that of RNA-trained models on mRNA tasks. We further show that recent, efficient sub-quadratic DNA-based state space model (SSM) architectures can be used with the CodonMoE to achieve parameter- and computationally-efficient predictions for mRNA tasks. Specifically, experimental results demonstrate that CodonMoE improves diverse DNA-based backbones by a large margin, with some models achieving comparable or superior performance to current state-of-the-art RNA- specific models across several downstream tasks, while reducing both time complexity and model parameters. Our results provide a path for focusing development efforts of gLMs on DNA models, which can then be adapted to mRNA tasks. Because DNA data is more prevalent than assembled mRNA data, and modeling efforts can focus on a single class of model, this is likely to foster improved DNA models for mRNA tasks at lower computational cost and is a significant step towards unifying genomic language modeling. [Image Omitted] link: Preprint http://arxiv.org/abs/0905.1053 link: Paper https://www.sciencedirect.com/science/article/abs/pii/S0166218X25000630 link: BibTeX bibtex/2025-kingsford-edgesynthesis.bib	01/01/2025	247	1.2
A synthesis for exactly 3-edge-connected graphs	A synthesis for exactly 3-edge-connected graphs =============================================== Carl Kingsford and Guillaume Marçais (2025) A synthesis for exactly 3-edge-connected graphs. _Discrete Applied Mathematics_ 368:18-29. A multigraph is uniformly 3-edge-connected if there are exactly 3 edge-disjoint paths between any pair of vertices. For example, a uniformly 3-edge-connected graph is obtained from a 3-edge-connected graph by collapsing the nodes connected by more than edge-disjoint paths into supernodes. We characterize the class of uniformly 3-edge-connected graphs, giving a synthesis involving two operations by which every uniformly 3-edge-connected multigraph can be generated. Slightly modified syntheses give the planar uniformly 3-edge-connected graphs and the uniformly 3-edge-connected graphs with the fewest possible edges, generalizing the well-known Harary graphs. In proving the correctness of the synthesis, we also show the existence of a particular type of induced, non-separating cycle in near 3-regular graphs, which is of interest in its own right. [Image Omitted] link: Preprint http://arxiv.org/abs/0905.1053 link: Paper https://www.sciencedirect.com/science/article/abs/pii/S0166218X25000630 link: BibTeX bibtex/2025-kingsford-edgesynthesis.bib	01/01/2025	196	1.0
Haotian Teng successfully defends his dissertation	Haotian Teng successfully defends his dissertation ================================================== August 26, 2024 Haotian Teng has successfully defended his Ph.D. dissertation. Teng was co-advised by Carl Kingsford and Ziv Bar-Joseph. Congratulations Dr. Teng! Haotian Teng (2024) Haotian Teng successfully defends his dissertation. Ph.D. Thesis Tech Report. Many fundamental biological tasks require unsupervised learning where groundtruth labels are unavailable, but shallow unsupervised machine learning methods have poor performance on these tasks due to the complexity of the problem. With their strong representation power, deep learning models have been widely applied to solve challenging tasks; however, they usually require large amounts of labeled data. To take advantage of the strong representation power of deep learning while applying it to unsupervised tasks, we developed several hybrid models that combine deep neural networks and unsupervised machine learning models. We used these models to improve performance on unsupervised biological tasks, including cell type clustering, basecalling, and molecular docking, demonstrating how a hybrid model can be used in solving spatial tasks (cell-type clustering), temporal tasks (basecalling), and spatial-temporal tasks (molecular docking). First, we present an unsupervised cell type clustering model for recently developed single-molecule, spatially resolved transcriptomics data, where a deep neural network (NN) encoder is used to generate low-dimensional, Gaussian-distributed gene embeddings, which are then combined with spatial relationships using a Gaussian-Multinomial Mixture Model developed by us to predict cell type clustering. The second problem we try to tackle is to call m6A methylated bases in RNA generated from long-read sequencing. m6A modification plays essential roles in regulating gene expression, but an efficient way to detect it systemically is lacking. The long-read sequencing from Oxford Nanopore Technologies has been shown to be sensitive to post-transcriptional modification, but an m6A sensitive basecaller for directly detecting this subtle sequencing signal has not yet been developed. We used a CNN-RNN (Convolutional-Recurrent Neural network) model previously developed by us for canonical basecalling to train a Non-homogeneous HMM (NHMM) where its transition matrix is conditioned on the deep NN output. Using the hybrid synthetically m6A methylation data sampled from the NHMM, we were able to train a NN basecaller to call m6A base. We applied our method to call the methylome in Yeast and Human RNA without the need for knock-out comparison data. For the third application, we developed a deep generative model with a SE(3)-equivariant diffusion transformer to address pocket-based molecular docking, where the 3D structure of the ligand is to be predicted given the protein pocket. We applied our model to a virtual screening task to select effective JAK2 inhibitors, identifying 13 candidate compounds with high affinity scores confirmed by wet lab assays from a total of 9,137 drugs, two of which are new molecules that have never been reported before.	08/26/2024	470	2.4
Journal version of "How much data is sufficient to learn high-performing algorithms?"	Journal version of "How much data is sufficient to learn high-performing algorithms?" ===================================================================================== Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2024) Journal version of "How much data is sufficient to learn high-performing algorithms?". J. ACM 71(5):32, pages 1--58 . The journal version of our 2021 STOC paper "How much data is sufficient to learn high-performing algorithms" has been accepted to the Journal of the ACM. The STOC 2021 version can be found here. https://dl.acm.org/doi/10.1145/3406325.3451036 See also this post. https://kingsfordlab.cbd.cmu.edu/2019-balcan-sampcomplex.html link: DOI https://doi.org/10.1145/367627 link: Preprint https://arxiv.org/abs/1908.02894	06/26/2024	122	0.6
Voices: How has the AI boom impacted algorithmic biology?	Voices: How has the AI boom impacted algorithmic biology? ========================================================= Mona Singh, Cenk Sahinalp, Jianyang Zeng, Wei Vivian Li, Carl Kingsford, Qiangfeng Zhang, Teresa Przytycka, Joshua Welch, Jian Ma, and Bonnie Berger (2024) Voices: How has the AI boom impacted algorithmic biology?. _Cell Systems_ 15(6): P483-487. The AI boom has affected algorithmic computational biology by further bringing traditional algorithmic thinking and ML and AI techniques closer together. One area where this is particularly true is in the field of automated algorithm design, where AI is used to inform or predict aspects of the design of an algorithm. Many traditional algorithmic tasks, such as genomic sequence alignment or transcript assembly, are implemented as highly parameterized algorithms, where the settings of these parameters can significantly affect the accuracy of the output. These can be hard to set by hand, requiring expertise, time, and a way to assess accuracy. This work can be avoided, while simultaneously increasing reproducibility and accuracy, through new AI approaches that use large datasets to train AI models to predict input-specific parameters for traditional, hand-designed algorithms. Such systems are especially useful when analyzing large, heterogeneous collections of samples where hand selection of optimal parameters is not feasible. Future work in this area involves deeper co-design of parameterized algorithms and AI systems to enable AI-driven optimization, possibly explicitly supporting the selection from among various large-scale algorithmic changes. An additional challenge is to codify the definition of the desired output to be able to optimize parameters for biological insight and utility. This is related to a third challenge, which is avoiding overfitting: when selecting from a large parameter space, trivial solutions that technically optimize the quality of the output but that are not useful can be obtained (for example, if the optimization metric is number of fragments aligned, selecting parameters that simply align all fragments poorly would satisfy the AI but not the biologist). link: DOI https://doi.org/10.1016/j.cels.2024.05.008	06/20/2024	331	1.7
Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants	Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants ================================================================================================== Yutong Qiu, Yihang Shen, and Carl Kingsford (2024) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. Alg. Mol. Biol.. The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/. link: DOI https://doi.org/10.1186/s13015-024-00262-6 link: Preprint https://arxiv.org/abs/2305.10577 link: Paper https://link.springer.com/article/10.1186/s13015-024-00262-6 link: PDF pdf/2024-qiu_shen-gtedjournal.pdf link: Code https://github.com/Kingsford-Group/gtednewilp/ link: BibTeX bibtex/2024-qiu_shen-gtedjournal.bib	04/28/2024	305	1.5
11% of RECOMB24 papers co-authored by current or former Kingsford group members	11% of RECOMB24 papers co-authored by current or former Kingsford group members =============================================================================== April 24, 2024 Six out of 57 accepted RECOMB24 papers are authored by current or former Kingsford group members. These include 2 papers authored by current Ph.D. students, and 4 authored by former trainees of the group. [Image Omitted] - Improving Hi-C contact matrices using genome graphs. Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang and Carl Kingsford - A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. Shane Elder, Quang Minh Hoang, Mohsen Ferdosi and Carl Kingsford -Inferring allele-specific copy number aberrations and tumor phylogeography using spatially resolved transcriptomics. Cong Ma, Metin Balaban, Clara Liu, Siqi Chen, Li Ding and Ben Raphael - Mapping the topography of spatial gene expression with interpretable deep learning. Uthsav Chitra, Brian Arnold, Hirak Sarkar, Cong Ma, Sereno Lopez-Darwin, Kohei Sanno and Ben Raphael - Meta-colored compacted de Bruijn graphs. Giulio Ermanno Pibiri, Jason Fan and Robert Patro - Accurate Assembly of Circular RNAs with TERRACE. Tasfia Zahin, Qian Shi, Xiaofei Carl Zang and Mingfu Shao Additionally, five papers, including some of the above, have an author currently affiliated with CMU. The acceptance rate was 16.5%	04/24/2024	201	1.0
Yihang Shen successfully defends his dissertation	Yihang Shen successfully defends his dissertation ================================================= April 10, 2024 Yihang Shen has successfully defended his Ph.D. Congratulations Dr. Shen! Yihang Shen (2024) Yihang Shen successfully defends his dissertation. Ph.D. Thesis (CMU-CB-24-100). Hyper-parameters play a crucial role in the efficacy of machine learning and computational biology tools. Their optimal selection profoundly impacts tool per- formance, yet manual tuning of these hyper-parameters can be a laborious process, demanding extensive domain knowledge. Therefore, the development of algorithms for automatic hyper-parameter tuning is important. While numerous strategies for hyper-parameter tuning in machine learning exist, certain challenges remain inade- quately addressed. These include tuning hyper-parameters (1) in contexts where the search space exhibits unique characteristics, such as high dimensionality; and (2) for computational biology tools, where optimal settings are closely tied to the specific biological sample being analyzed. In response to these challenges, this dissertation focuses on the development of novel algorithms designed for the automatic tuning of hyper-parameters across various tool types. In the first part, we focus on developing new Bayesian Optimization methods for hyper-parameter tuning in high-dimensional search spaces. Bayesian Optimization is a machine learning method that is widely used in various scenarios, including the tuning of hyper-parameters. Yet, its application in high-dimensional search spaces presents a significant challenge that remains to be fully addressed. To overcome this, we develop a new high-dimensional Bayesian Optimization framework based on the concept of variable selection and show that the new method is more computationally efficient than previous high-dimensional Bayesian Optimization methods. In the second part, we focus on developing hyper-parameter tuning algorithms for transcript assemblers. Transcript assemblers are tools for reconstructing expressed transcripts from the reads in a given RNA-seq sample. Given that these tools have many tunable hyper-parameters, and their optimal configurations greatly depend on the characteristics of the input sample, it is crucial to develop automatic tuning methods that adapt to different inputs. We develop the first adaptive, sample-specific hyper-parameter tuning system for transcript assemblers. This innovation marks an important advancement towards more precise transcript assembly, which in turn will enhance downstream RNA-seq analyses such as transcript quantification. In high-throughput sequencing biological data analysis, the initial step is to align reads to a linear reference genome to determine their genomic locations. Recog- nizing that genetic variations differ among biological samples, it is crucial to use a sample-specific reference genome rather than a default one. Therefore, automati- cally deducing the sample-specific reference genome directly from the sample data becomes an important problem. It is a unique hyper-parameter tuning problem, where the reference genome represents the hyper-parameter and the search space encompasses various potential genomes. In the last part, we focus on developing algorithms to infer genomes from Hi-C, a distinct type of high-throughput sequencing data providing insights into the spatial arrangement of chromosomes. We show that using an inferred genome improves downstream Hi-C analyses, thereby contributing to a more profound understanding of chromosomal organization and function.	04/10/2024	514	2.6
Carl Kingsford named ISCB Fellow	Carl Kingsford named ISCB Fellow ================================ March 8, 2024 Carl Kingsford, Herbert A. Simon Professor of Computer Science in the Ray and Stephanie Lane Computational Biology Department, has been elected as a Fellow of the International Society of Computational Biology (ISCB). The ISCB notes that Carl has been chosen for this honor because he “is a trailblazer in computational molecular biology, showcasing sustained innovation in scalable algorithmic approaches.” Carl’s research is focused on developing new, efficient algorithms and AI methods for extracting knowledge from large biological data sets, particularly high-throughput DNA and RNA sequencing data. He has worked on algorithms for accurately quantifying gene expression, identifying compact regions of chromatin, and large-scale sequence search. Carl’s group continues to push the boundaries of how computer science can drive scientific discovery, including recently developing new pan-genomic analysis algorithms using genome graphs, using reinforcement learning to optimize experimental protocols to drive automated experimentation, and creating new meta-learning techniques for adapting deep neural networks to new tasks with limited data. He is also director of CMU’s Center for Machine Learning and Health (CMLH). The ISCB Fellows Program, introduced in 2009, is a prestigious recognition within the field of computational biology and honors members that have distinguished themselves through outstanding contributions to the field, provided to only ½ of a percent of the previous year’s ISCB membership. The new Fellows will be introduced during this year’s ISMB conference in July 2024. For more information on this year’s fellows, visit ISCB’s website. https://www.iscb.org/iscb-news-items/5232-march-8-2024-iscb-congratulates-and-introduces-the-2024-class-of-fellows	03/08/2024	275	1.4
k-nonical space: sketching with reverse complements	k-nonical space: sketching with reverse complements =================================================== Guillaume Marçais, C.S. Elder, and Carl Kingsford (2024) k-nonical space: sketching with reverse complements. _Bioinformatics_. Sequences equivalent to their reverse complements (i.e., double-stranded DNA) have no equivalent in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms. [Image Omitted] Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: the canonical representation (k-nonical space). The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (“sketching deserts”) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accomodate for these effects: (1) a new procedure that adapts existing sketching methods to k-nonical space and (2) an optimization procedure to directly design new sketching methods for k-nonical space. link: DOI https://doi.org/10.1101/2024.01.25.577301 link: Preprint https://www.biorxiv.org/content/10.1101/2024.01.25.577301v1 link: Code https://github.com/Kingsford-Group/mdsscope	01/27/2024	275	1.4
Adaptive, sample-specific parameter selection for more accurate transcript assembly	Adaptive, sample-specific parameter selection for more accurate transcript assembly =================================================================================== Yihang Shen, Zhiwen Yan, and Carl Kingsford (2024) Adaptive, sample-specific parameter selection for more accurate transcript assembly. bioRxiv. Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples. Manually tuning parameters for each sample is time consuming and requires specialized experience. Therefore, developing an automated system that can advise good parameter settings for individual samples becomes an important problem. Results: Using Bayesian optimization and contrastive learning, we develop a new automated parameter advising system for transcript assembly that can generate sets of sample-specific good parameter candidates. Our framework achieves efficient sample-specific parameter advising by learning parameter knowledge from a representative set of existing RNA-seq samples and transferring the knowledge to unseen samples. We use Scallop and StringTie, two well-known transcript assemblers, to test our framework on two collections of RNA-seq samples. Results show that our new parameter advising system significantly outperforms the previous advising method in each dataset and each transcript assembler. link: DOI https://doi.org/10.1101/2024.01.25.577290	01/27/2024	238	1.2
Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation	Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation ====================================================================== Minh Hoang and Carl Kingsford (2024) Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation. ICLR 2024. We tackle the problem of meta-learning across heterogenous tasks. This problem seeks to extract and generalize transferable meta-knowledge through streaming task sets from a multi-modal task distribution. The extracted meta-knowledge can be used to create predictors for new tasks using a small number of labeled samples. Most meta-learning methods assume a homogeneous task distribution, thus limiting their generalization capacity when handling multi-modal task distributions. Recent work has shown that the generalization of meta-learning depends on the similarity of tasks in the training distribution, and this has led to many clustering approaches that aim to detect homogeneous clusters of tasks. However, these methods suffer from a significant increase in parameter complexity. To overcome this weakness, we propose a new heterogeneous meta-learning strategy that efficiently captures the multi-modality of the task distribution via modulating the routing between convolution channels in the network, instead of directly modulating the network weights. This new mechanism can be cast as a permutation learning problem. We further introduce a novel neural permutation layer based on the classical Benes routing network, which has sub-quadratic parameter complexity in the total number of channels, as compared to the quadratic complexity of the state-of-the-art Gumbel-Sinkhorn layer. We demonstrate our approach on various multi-modal meta-learning benchmarks, showing that our framework outperforms previous methods in both generalization accuracy and convergence speed. link: Paper https://openreview.net/forum?id=QiJuMJl0QS	01/16/2024	266	1.3
A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements	A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements =========================================================================================================== C.S. Elder, Minh Hoang, Mohsen Ferdosi, and Carl Kingsford. (2024) A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. RECOMB 2024. The Beltway and Turnpike problems entail the reconstruction of circular and linear one-dimensional point sets from unordered pairwise distances. These problems arise in computational biology when the measurements provide distances but do not associate those distances with the entities that gave rise to them. Such applications include molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes (since sequencing and mass spec technologies can give lengths or weights, usually without connecting them to end points). Practical algorithms for Turnpike are known when the distance measurements are accurate, but both problems become strongly NP-complete under any level of measurement uncertainty. This is problematic since all known applications experience some degree of uncertainty from uncontrollable factors. Traditional algorithms cope with this complexity by exploring a much larger solution space, leading to exponential blowup in terms of both time and space. To alleviate both issues, we propose a novel alternating optimization algorithm that is able to scale to large, uncertain distance sets with as many as 100,000 points. This algorithm is space and time-efficient, with each step running in $O(m log(m))$ time and requiring only $O( sqrt{m})$ work space for a distance set of size $m$. Evaluations of this approach on synthetic and partial digest data showcase improved accuracy and scalability in the presence of uncertain, duplicated, and missing distances. Our implementation of the algorithm is available here. link: DOI https://doi.org/10.1101/2024.02.15.580520 link: Preprint https://www.biorxiv.org/content/10.1101/2024.02.15.580520v1 link: Code https://github.com/Kingsford-Group/turnpikesolvermm	01/12/2024	308	1.5
Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework	Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework ================================================================================================== Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, and Carl Kingsford (2024) Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework. _Genome Research_ 34:1987-1999. Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps. First, we generate data with more diverse modification combinations through in silico cross-linking. Second, we use this dataset to train an end-to-end neural network basecaller followed by fine-tuning on immunoprecipitation-based experimental data with label-smoothing. The trained neural network basecaller outperforms existing methylation detection methods on both read-level and site-level prediction scores. Xron is a standalone, end-to-end m6A-distinguishing basecaller capable of detecting methylated bases directly from raw sequencing signals, enabling de novo methylome assembly. link: DOI https://doi.org/10.1101/2024.01.06.574484 link: Preprint https://biorxiv.org/cgi/content/short/2024.01.06.574484v1 link: Paper https://genome.cshlp.org/content/34/11/1987.full.pdf link: Code https://github.com/haotianteng/xron	01/08/2024	244	1.2
Improving Hi-C contact matrices using genome graphs	Improving Hi-C contact matrices using genome graphs =================================================== Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang, and Carl Kingsford (2024) Improving Hi-C contact matrices using genome graphs. RECOMB 2024. Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes. [Image Omitted] However, the linear reference genome does not incorporate genetic variation information, which can lead to incorrect read alignments, especially when analyzing samples with substantial genomic differences from the reference such as cancer samples. Using genome graphs as the reference facilitates more accurate mapping of reads, however, new algorithms are required for inferring linear genomes from Hi-C reads mapped on genome graphs and constructing corresponding Hi-C contact matrices, which is a prerequisite for the subsequent steps of the Hi-C analysis such as identifying topologically associated domains and calling chromatin loops. We introduce the problem of genome sequence inference from Hi-C data mediated by genome graphs. We formalize this problem, show the hardness of solving this problem, and introduce a novel heuristic algorithm specifically tailored to this problem. We provide a theoretical analysis to evaluate the efficacy of our algorithm. Finally, our empirical experiments indicate that the linear genomes inferred from our method lead to the creation of improved Hi-C contact matrices. These enhanced matrices show a reduction in erroneous patterns caused by structural variations and are more effective in accurately capturing the structures of topologically associated domains. link: Preprint https://doi.org/10.1101/2023.11.08.566275 link: PDF pdf/2024-shen-graphhic.pdf link: Code https://github.com/Kingsford-Group/graphhic	11/13/2023	297	1.5
Sketching methods with small window guarantee using minimum decycling sets	Sketching methods with small window guarantee using minimum decycling sets ========================================================================== Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2023) Sketching methods with small window guarantee using minimum decycling sets. Journal of Computational Biology. Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. In particular the window guarantee ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee corresponds to a Decycling Set, aka an unavoidable sets of k-mers. Any long enough sequence must contain a k-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set defines a sketching method by selecting the k-mers from the set. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their small size. Only two algorithms, by Mykkeltveit and Champarnaud, are known to generate two particular MDSs, although there is a vast number of alternative MDSs. We provide a simple method that allows one to explore the space of MDSs and to find sets optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. link: DOI https://doi.org/10.48550/arXiv.2311.03592 link: Preprint https://arxiv.org/abs/2311.03592 link: Paper https://doi.org/10.1089/cmb.2024.0544	11/09/2023	342	1.7
Minh Hoang successfully defends his dissertation	Minh Hoang successfully defends his dissertation ================================================ [Image Omitted] October 16, 2023 Minh Hoang has successfully defended his Ph.D. dissertation, which is titled “Practical Methods for Automated Algorithm Design in Machine Learning and Computational Biology.” He will join Princeton University as a postdoc. Congratulations Dr. Hoang! Minh Hoang (2023) Minh Hoang successfully defends his dissertation. Ph.D. Thesis. Configuration tuning is an essential practice to achieve good performance with many computational methods. However, configuring complex and discrete algorithms often requires significant trial-and-error effort due to a lack of automated solutions. In large-scale systems where computational tasks are numerous and constantly changing in specificity, the repetitive cost of manual tuning becomes a major bottleneck that hinders scalability. Moreover, the absence of a systematic approach to configure deployment settings makes it challenging to replicate the obtained results in different deploying conditions. To address these problems, this thesis focuses on developing new data-driven automated algorithm design (AAD) frameworks in several classical and multi-task settings. Specifically, in the classical configuration tuning setting, we address the problems of kernel selection for Bayesian methods, and minimizer construction for biological sequence sketching. In the multi-task scenario, we address the problems of privacy-preserving neural architecture search for multiple clients, and meta-learning for parameter optimization in a heterogeneous task stream. In all of these problems, the variables to be optimized often have underlying discrete structures such as trees, graphs or permutations. Our contribution is a suite of reformulation techniques that result in efficient and accurate tuning methods for these configuration domains. Finally, we demonstrate the performance of our methods on practical scenarios and show that they have significantly outperformed state-of-the-art benchmarks. link: Paper http://reports-archive.adm.cs.cmu.edu/anon/usr0/ftp/home/anon/home/ftp/cbd/abstracts/20-101.html link: PDF pdf/2023-news-minhdefends.pdf	10/16/2023	311	1.6
Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC	Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC ============================================================================================= Hossein Asghari, Ehsan Haghshenas, Roby Thomas, Eric Schultz, Rob Patro, Stan Skrzypczak, and Carl Kingsford (2023) Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC. Cancer Res (2023) 83 (7_Supplement): 1400. link: DOI https://doi.org/10.1158/1538-7445.AM2023-1400 link: Paper https://doi.org/10.1158/1538-7445.AM2023-1400	06/21/2023	70	0.3
Molecular predictors and immunomodulatory role of dual checkpoint inhibitor blockade using ipilimumab/nivolumab in patients with extensive stage small cell lung cancer	Molecular predictors and immunomodulatory role of dual checkpoint inhibitor blockade using ipilimumab/nivolumab in patients with extensive stage small cell lung cancer ======================================================================================================================================================================= AC Chiang, H Asghari, K Ashley, S Gettinger, S Goldberg, R Herbst, FH Wilson, BR Newton, MK Cohenuram, KD Sabbath, AS Talsania, AV Russo, E Schultz, S Skrzypczak, C Kingsford, and KS Schalper (2023) Molecular predictors and immunomodulatory role of dual checkpoint inhibitor blockade using ipilimumab/nivolumab in patients with extensive stage small cell lung cancer. J Clin Oncol 41, 2023 (suppl 16; abstr 8597). link: DOI https://doi.org/10.1200/JCO.2023.41.16_suppl.8597 link: Paper https://doi.org/10.1200/JCO.2023.41.16_suppl.8597	06/21/2023	111	0.6
Reinforcement Learning for Robotic Liquid Handler Planning	Reinforcement Learning for Robotic Liquid Handler Planning ========================================================== Mohsen Ferdosi, Yuejun Ge, and Carl Kingsford (2023) Reinforcement Learning for Robotic Liquid Handler Planning. In Proceedings of WABI 2023. Robotic liquid handlers play a crucial role in automating laboratory tasks such as sample preparation, high-throughput screening, and assay development. Manually designing protocols takes significant effort, and can result in inefficient protocols and involve human error. We investigate the application of reinforcement learning to automate the protocol design process resulting in reduced human labor and further automation in liquid handling. We develop a reinforcement learning agent that can automatically output the step-by-step protocol based on the initial state of the deck, reagent types and volumes, and the desired state of the reagents after the protocol is finished. We show that finding the optimal protocol for solving a liquid handler instance is NP-complete, and we present a reinforcement learning algorithm that can solve the planning problem practically for cases with a deck of up to 20 × 20 wells and four different types of reagents. We design and implement an actor-critic approach, and we train our agent using the Impala algorithm. Our findings demonstrate that reinforcement learning can be used to automatically program liquid handler robotic arms, enabling more precise and efficient planning for the liquid handler and laboratory automation. link: Paper https://doi.org/10.4230/LIPIcs.WABI.2023.23 link: PDF pdf/2023-ferdosi-liquidhandler.pdf	06/21/2023	238	1.2
Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme	Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme ========================================================================================== Minh Hoang, Guillaume Marçais, and Carl Kingsford (2024) Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. Journal of Computational Biology 31(1):2-20. Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used. link: DOI https://doi.org/10.1089/cmb.2023.0212 link: Preprint https://doi.org/10.1101/2022.10.18.512430 link: Code https://github.com/Kingsford-Group/maskedminimizer	06/21/2023	312	1.6
Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection	Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection ======================================================================================= Yihang Shen and Carl Kingsford (2023) Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection. In Proceedings of AutoML 2023. Bayesian Optimization (BO) is a method for globally optimizing black-box func- tions. While BO has been successfully applied to many scenarios, developing effective BO algorithms that scale to functions with high-dimensional domains is still a challenge. Optimizing such functions by vanilla BO is extremely time-consuming. Alternative strategies for high-dimensional BO that are based on the idea of embedding the high-dimensional space to one with low dimensions are sensitive to the choice of the embedding dimension, which needs to be pre-specified. We develop a new computationally efficient high-dimensional BO method that exploits variable selection. Our method is able to automatically learn axis-aligned sub-spaces, i.e. spaces containing selected variables, without the demand of any pre-specified hyperparameters. We analyze the computational complexity of our algorithm. We empirically show the efficacy of our method on several synthetic and real problems. link: PDF pdf/2023-shen-vsbo.pdf	06/21/2023	183	0.9
Creating and Using Minimizer Sketches in Computational Genomics	Creating and Using Minimizer Sketches in Computational Genomics =============================================================== Hongyu Zheng, Guillaume Marçais, and Carl Kingsford (2023) Creating and Using Minimizer Sketches in Computational Genomics. J Comp. Biol. (2023). Processing large datasets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared to the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future. link: PDF pdf/2023-zheng-minreview.pdf	06/21/2023	188	0.9
Carl chairs and CMU hosts an NSF-NIH Joint Workshop on Emerging AI in Biology	Carl chairs and CMU hosts an NSF-NIH Joint Workshop on Emerging AI in Biology ============================================================================= June 10, 2023 In order to survey the current frontier of the interface between AI methodology and biology and to chart future directions and challenges, we held an "NSF-NIH Joint Workshop on Emerging AI in Biology" in June 2023 that gathered approximately 40 experts on the intersection of research in AI and biology. New techniques in artificial intelligence (AI) are rapidly being developed, extended and applied to challenging problems in biology. At the same time, as new assays, new data collection efforts, and greater understanding are developed in biology, the scope of problems that are amendable to AI approaches is growing. The workshop included scientific presentations about cutting-edge applications and computational methodology relating to the use of AI in biological sciences, many of which were subsequently publicly posted to the web. The workshop also included significant discussion time, during which participants collaborated on drafting a report to the NIH and NSF that communicated some of the opportunities and challenges in applying AI in biology. These insights were shaped into a final report that was delivered to the NSF and NIH. https://www.cs.cmu.edu/cih/events/cih-event-ai-biology-2023 Videos of some of the talks are available here. https://www.youtube.com/playlist?list=PLzjJEuCotCQxpyY4gna09l1rPiramjNrc	06/10/2023	225	1.1
Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants	Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants ================================================================================================== Yutong Qiu, Yihang Shen, and Carl Kingsford (2023) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. In Proceedings of WABI 2023. The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error- prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs will always yield optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics that estimate GTED efficiently. link: DOI 10.4230/LIPIcs.WABI.2023.11 link: Preprint https://arxiv.org/abs/2305.10577 link: Paper https://doi.org/10.4230/LIPIcs.WABI.2023.11 link: PDF pdf/2023-qui_shen-gted.pdf link: Code https://github.com/Kingsford-Group/gtednewilp/ link: BibTeX bibtex/2023-qiu_shen-gted.bib	05/19/2023	290	1.4
Yutong Qiu successfully defends her dissertation	Yutong Qiu successfully defends her dissertation ================================================ May 5, 2023 Yutong Qiu, a CPCB Ph.D. student, successfully defended her dissertation titled “Algorithmic Foundations of Genome Graph Construction and Comparison.” She will join Illumina, Inc. Congratulations, Dr. Qiu! Yutong Qiu (2023) Yutong Qiu successfully defends her dissertation. Ph.D. Thesis CMU-CB-23-101. Pangenomic studies have enabled a more accurate depiction of the human genome landscape. Genome graphs are suitable data structures for analyzing collections of genomes due to their efficiency and flexibility of encoding shared and unique substrings from the population of encoded genomes. Novel challenges arise when genome graphs are applied to thousands of genomes because current genome graph models are insufficient in addressing the questions: (1) How can genome graphs be constructed efficiently that optimize the storage space? (2) How can genome graphs be used to more accurately and more efficiently compare heterogeneous sequences such as cancer genomes or immune repertoires? To answer these questions, we lay algorithmic foundations for genome graph construction and comparison. The size of a genome graph is crucial to both efficient storage and analysis. How- ever, few genome graph construction methods directly optimize the graph size. By drawing connections to data compression, we develop an algorithmic framework for genome graph construction that prioritizes genome graph size and show that the new framework produces small genome graphs efficiently compared to other genome graph schemes. Our compression-based framework not only removes the depen- dency on hyper-parameters but also opens up the potential for adapting established compression algorithms to construct better genome graphs. In many scenarios, such as immune repertoire analysis, we need to quantify the similarity between heterogeneous sets of genomic strings, but the complete strings are unknown due to limitations in sequencing technology. The distance between genome graphs can be used to estimate to the difference between these strings. One important metric is defined as the graph traversal edit distance (GTED). We revisit the complexity of and the previously proposed algorithms for GTED. We prove that GTED is NP-complete and show that the previously proposed algorithms computes a lower bound of GTED. In addition, we propose two correct ILP formulations of GTED and characterize the relationship between GTED and the previous lower bound ILPs. We evaluate the empirical efficiency of solving GTED and its lower bound ILP and show that solving GTED exactly with ILPs is currently not practical on larger genomes. Genome graphs are often highly expressive and represent more than one string sets, and thus the distance between two graphs using standard graph distances does not always model the actual edit distance between true string sets. To quantify this discrepancy, we formally define genome graph expressiveness as its diameter and use it to bound the deviation of the genome graph distance from string set distances. We produce a more accurate distance measure between (unseen) collections of strings encoded as genome graphs. The new distance measure and its deviation from string set distances are evaluated on simulated human T-cell repertoire sequences and Hepatitis B virus genomes. link: PDF pdf/2023-news-yutongdefends.pdf	05/05/2023	515	2.6
Laura Tung successfully defends her Ph.D. thesis	Laura Tung successfully defends her Ph.D. thesis ================================================ October 28, 2022 Laura Tung successfully defended her Ph.D. thesis. She will join Gaurdant Health, Inc. Congratulations, Dr. Tung! [Image Omitted] Laura H. Tung (2022) Laura Tung successfully defends her Ph.D. thesis. Ph.D. (CMU-CB-22-108). Studying the transcriptome is crucial to understanding functional elements of the genome and elucidating biological pathways associated with disease. High- throughput sequencing such as RNA-seq has become a powerful tool for transcrip- tome analysis. Due to limited read lengths, identifying full-length transcripts from short reads remains challenging. As third-generation sequencing becomes increas- ingly important, single-molecule long reads have been used to improve transcrip- tome analyses such as isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis. This drives a need for transcript assembly on long reads. We developed a reference-based long-read transcript assembler, Scallop-LR, aiming to discover more novel isoforms. Analyzing a considerable number of RNA-seq long-read samples, we quantified the benefit of performing transcript assembly on long reads. We demonstrate that Scallop-LR identifies more known transcripts and potentially novel isoforms for the human transcriptome than Iso-Seq Analysis and StringTie, indicating that long-read tran- script assembly by Scallop-LR can reveal a more complete human transcriptome. Nanopore sequencing has become a leading choice for long-read RNA-seq. How- ever, Nanopore long reads have high error rates. For many non-model organisms without a high-quality reference, de novo (reference-free) error correction methods designed for RNA-seq long reads are needed. We developed a novel, error-profile-aware correction method, deepCorrRNA, for correcting RNA-seq long reads de novo using deep learning. deepCorrRNA combines a graph-based method and a deep neural network that incorporates the error profile related information systematically. We show that ML-based deepCorrRNA achieves comparable error-rate reductions to state-of-the-art ONT-specific isONcorrect. Across different organisms, deepCor-rRNA demonstrates robust de novo error correction capability, which can benefit the transcriptome studies of non-model organisms. deepCorrRNA’s method in principle is generalizable and may be applied to different technologies. To accelerate transcriptome analyses, RNA-seq analysis tools require comprehensive evaluation and parameter optimization. While the number of RNA-seq samples grows enormously at large sequence databases, most RNA-seq analysis tools are evaluated on limited RNA-seq samples. This leads to a need to select a representative subset from RNA- seq samples at large databases, which effectively summarizes the original collection of RNA-seq samples. We developed a novel hierarchical representative set selection method, to tackle the memory and runtime challenges in k-mer counting approaches for RNA-seq samples in a large database. We demonstrate that hierarchical represen- tative set selection achieves summarization quality close to direct representative set selection, while largely reducing the runtime and memory usage, and substantially outperforms random sampling on the entire SRA set of human RNA-seq samples. The algorithms, methods, and analysis we have developed can be used to improve transcriptome analyses and further our understanding of complex transcriptomes. link: PDF pdf/2022-news-tungphd.pdf	10/28/2022	529	2.6
The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance	The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance ================================================================================================================== Yutong Qiu and Carl Kingsford (2022) The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance. In Proceedings of ISMB 2022 in Bioinformatics 38(Supplement_1):i404-412 (2022).. Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover's Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation: Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/. link: DOI 10.1093/bioinformatics/btac264 link: Paper https://doi.org/10.1093/bioinformatics/btac264 link: BibTeX bibtex/2022-qiu-expressive.bib	08/24/2022	333	1.7
Hongyu Zheng successfully defends his Ph.D. dissertation	Hongyu Zheng successfully defends his Ph.D. dissertation ======================================================== February 24, 2022 Hongyu Zheng successfully defended his Ph.D. thesis. He will joint Princeton University as a postdoc. Congradulations, Dr. Zheng! Hongyu Zheng (2022) Hongyu Zheng successfully defends his Ph.D. dissertation. Ph.D. Dissertation. Sequence sketch methods generate compact fingerprints of large string sets for efficient indexing and searching. Minimizers are one of such sketching methods, sampling k-mers from a string by selecting the minimal k-mer from each sliding window with a predetermined ordering. Minimizers sketches preserve information to detect sufficiently long substring matches. This favorable property of minimizers, and its ease of implementation, lead to wide adoption in a large number of software and pipelines, including read mappers, genome assemblers, sequence databases and more. Despite the method’s popularity, many theoretical and practical questions regarding its performance remain open. This is especially true regarding the density of minimizers, a simple yet powerful metric for minimizer performance. We have neither understanding of density growth under various conditions, nor principled approaches to construct minimizers with provably low density. This dissertation attacks the lack of knowledge in minimizer sketches from multiple fronts. In the first part, we are primarily concerned with asymptotic behavior for the density of minimizer sketches. Minimizers are parameterized by the k-mer length k, a window length w and an order on the k-mers. Using a number of new techniques, we are able to provide a complete picture of how the density of optimal minimizer grows with its parameters w and k. We also derive structural lemmas for universal hitting sets and local schemes, two highly related concepts for minimizer design. Together, these results serve as building blocks for future theoretical advances. The next two parts are focused on constructing low-density minimizers in two scenarios. In the second part, we propose the Miniception, the first construction of minimizers that provably achieves better densities than a random minimizer in practical configurations of w and k. This method is also simple to implement and requires minimal overhead compared to existing implementations of random mini- mizers. In the third and final part, we consider optimization of minimizer density on a given reference sequence. The most common use case is when the given se- quence is the reference genome or transcriptome. We propose the concept of polar sets, which itself is a complementary concept of universal hitting sets and similarly can be used to construct minimizers. Using polar sets, we propose efficient algo- rithms to directly optimize the density of minimizers on a reference sequence up to some error. Experiments show that both the Miniception and the polar set algorithms can reliably outperform existing methods for constructing low-density minimizers, without a reference and with a reference, respectively. link: PDF pdf/2022-news-zhengphd.pdf	02/24/2022	467	2.3
DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes	DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes ============================================================================================ Minh Hoang, Hongyu Zheng, and Carl Kingsford (2022) DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes. In Proceedings of RECOMB 2022.. Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite its widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question. Most work in this direction focuses on designing schemes that work well on expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, require greedy approximations to solve an intractable discrete optimization problem on the permutation space of $k$-mer orderings. To address this challenge, we propose: (a) a reformulation of the combinatorial solution space using a deep neural network reparameterization; and (b) a fully differentiable approximation of the discrete objective. We demonstrate that our framework, extsc{DeepMinimizer}, discovers minimizer schemes that significantly outperform state-of-the-art constructions on genomic sequences. link: Code https://github.com/Kingsford-Group/deepminimizer	12/21/2021	194	1.0
Personalized neural architecture search for federated learning.	Personalized neural architecture search for federated learning. =============================================================== Minh Hoang and Carl Kingsford (2021) Personalized neural architecture search for federated learning.. In NeurIPS 2021 Workshop on New Frontiers in Federated Learning 2021.. Federated Learning (FL) is a recently proposed learning paradigm for decentralized devices to collaboratively train a predictive model without exchanging private data. Existing FL frameworks, however, assume a one-size-fit-all model architecture to be collectively trained by local devices, which is determined prior to observing their data. Even with good engineering acumen, this often falls apart when local tasks are different and require diverging choices of architecture modelling to learn effectively. This motivates us to develop a novel personalized neural architecture search (NAS) algorithm for FL, which learns a base architecture that can be structurally personalized for quick adaptation to each local task. On several real- world datasets, our algorithm, FEDPNAS is able to achieve superior performance compared to other benchmarks on heterogeneous multitask scenarios. link: Paper https://neurips2021workshopfl.github.io/NFFL-2021/papers/2021/Hoang2021.pdf	08/24/2021	170	0.8
How much data is sufficient to learn high-performing algorithms?	How much data is sufficient to learn high-performing algorithms? ================================================================ Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2019) How much data is sufficient to learn high-performing algorithms?. STOC 2021 (preprint arXiv:1908.02894 [cs.LG] (2019)). Algorithms for scientific analysis typically have tunable parameters that significantly influence computational efficiency and solution quality. If a parameter setting leads to strong algorithmic performance on average over a set of typical problem instances, that parameter setting---ideally---will perform well in the future. However, if the set of typical problem instances is small, average performance will not generalize to future performance. This raises the question: how large should this set be? We answer this question for any algorithm satisfying an easy-to-describe, ubiquitous property: its performance is a piecewise-structured function of its parameters. We are the first to provide a unified sample complexity framework for algorithm parameter configuration; prior research followed case-by-case analyses. We present applications from diverse domains, including biology, political science, and economics. link: Preprint https://arxiv.org/abs/1908.02894 link: BibTeX bibtex/2019-balcan-sampcomplex.bib	08/09/2021	189	0.9
Discovery of a potential predictive marker for eribulin treatment and novel target genes in BRAF V600E mutant metastatic colorectal cancer using an AI-driven RNA-seq analysis platform: Translational research of the BRAVERY study (EPOC1701)	Discovery of a potential predictive marker for eribulin treatment and novel target genes in BRAF V600E mutant metastatic colorectal cancer using an AI-driven RNA-seq analysis platform: Translational research of the BRAVERY study (EPOC1701) =============================================================================================================================================================================================================================================== Toshiki Masuishi, Hiroya Taniguchi, Daisuke Kotani, Hideaki Bando, Taroh Satoh, Taito Esaki, Yoshito Komatsu, Yu Sunakawa, Tomohiro Nishina, Eiji Shinozaki, Naohiro Nishida, Masato Komoda, Satoshi Yuki, Naoki Izawa, Gaurav Sharma, Stan Skrzypczak, Eric Schultz, Carl Kingsford, Akihiro Sato, and Takayuki Yoshino (2021) Discovery of a potential predictive marker for eribulin treatment and novel target genes in BRAF V600E mutant metastatic colorectal cancer using an AI-driven RNA-seq analysis platform: Translational research of the BRAVERY study (EPOC1701). Journal of Clinical Oncology 39(15 suppl): e15532 (2021).	06/21/2021	121	0.6
Constructing small genome graphs via string compression	Constructing small genome graphs via string compression ======================================================= Yutong Qiu and Carl Kingsford (2021) Constructing small genome graphs via string compression. ISMB 2021. The size of a genome graph -- the space required to store the nodes, their labels and edges -- affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. The size of the graph also affects the size of the graph index that is used to speed up the alignment. This raises the need for approaches to construct space-efficient genome graphs. We point out similarities in the string encoding approaches of genome graphs and the external pointer macro (EPM) compression model. Supported by these similarities, we present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. We show that the algorithms result in an upper bound on the size of the genome graph constructed based on an optimal EPM compression. In addition to the transformation, we show that equivalent choices made by EPM compression algorithms may result in different sizes of genome graphs. To further optimize the size of the genome graph, we purpose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv EPM compression algorithm. We show that using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph software is available at https://github.com/Kingsford-Group/rlzgraph link: DOI https://doi.org/10.1101/2021.02.08.430279 link: Preprint https://www.biorxiv.org/content/10.1101/2021.02.08.430279v1 link: Code https://github.com/Kingsford-Group/rlzgraph	02/10/2021	344	1.7
Practical selection of representative sets of RNA-seq samples using a hierarchical approach	Practical selection of representative sets of RNA-seq samples using a hierarchical approach =========================================================================================== Laura H. Tung and Carl Kingsford (2021) Practical selection of representative sets of RNA-seq samples using a hierarchical approach. ISMB 2021. Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the huge number of available RNA-seq samples and the large number of k-mers/sequences in each sample, computing the full similarity matrix between all samples using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges, making direct representative set selection infeasible with limited computing resources. Therefore, we developed a novel computational method called "hierarchical representative set selection" to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks the representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve performance close to that of direct representative set selection, while largely reducing the runtime and memory requirements of computing the full similarity matrix (up to 8.4X runtime reduction and 4.7X memory reduction for 10000 samples that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases such as the SRA. link: DOI https://doi.org/10.1101/2021.02.04.429817 link: Preprint https://biorxiv.org/cgi/content/short/2021.02.04.429817v1 link: Code https://github.com/Kingsford-Group/hierrepsetselection	02/06/2021	342	1.7
Yutong Qiu awarded SCS Cancer Research fellowship	Yutong Qiu awarded SCS Cancer Research fellowship ================================================= February 5, 2021 CPCB Ph.D. student Yutong Qiu has been awarded an SCS Cancer Research Fellowship. Yutong works on computational methods for understanding the human genome, including methods to identify variants within single genomes and populations of genomes. Her recent work is focused on construction and use of genome graphs for applications in cancer, especially more accurate subtyping of cancers from genomic features. [Image Omitted] Congratulations! More details here http://cbd.cmu.edu/news/2020/scs-cancer-research-fellowships-awarded-to-yutong-qiu-and-trevor-frisby.html	02/05/2021	95	0.5
Carl Kingsford awarded the Herbert A. Simon Professorship in Computer Science	Carl Kingsford awarded the Herbert A. Simon Professorship in Computer Science ============================================================================= February 4, 2021 Carl will receive the Herbert A. Simon Professorship of Computer Science in a virtual ceremony at 5:30 p.m. on Thursday, Feb. 4, 2021. For more information see here. https://www.cs.cmu.edu/news/2021/scs-celebrates-simon-alumni-research-professorships	02/04/2021	57	0.3