2024

Haotian Teng successfully defends his dissertation

August 26, 2024 — Haotian Teng has successfully defended his Ph.D. dissertation. Teng was co-advised by Carl Kingsford and Ziv Bar-Joseph. Congratulations Dr. Teng!

Haotian Teng (2024) Improving performance on unsupervised biological tasks with hybrid models. Ph.D. Thesis Tech Report.

Journal version of "How much data is sufficient to learn high-performing algorithms?"

Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2024) Journal version of "How much data is sufficient to learn high-performing algorithms?". J. ACM 71(5):32, pages 1--58 .

The journal version of our 2021 STOC paper "How much data is sufficient to learn high-performing algorithms" has been accepted to the Journal of the ACM.

Voices: How has the AI boom impacted algorithmic biology?

Mona Singh, Cenk Sahinalp, Jianyang Zeng, Wei Vivian Li, Carl Kingsford, Qiangfeng Zhang, Teresa Przytycka, Joshua Welch, Jian Ma, and Bonnie Berger (2024) Voices: How has the AI boom impacted algorithmic biology?. Cell Systems 15(6): P483-487.

The AI boom has affected algorithmic computational biology by further bringing traditional algorithmic thinking and ML and AI techniques closer together. One area where this is particularly true is in the field of automated algorithm design, where AI is used to inform or predict aspects of the design of an algorithm.

Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants

Yutong Qiu, Yihang Shen, and Carl Kingsford (2024) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. Alg. Mol. Biol..

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly.

11% of RECOMB24 papers co-authored by current or former Kingsford group members

April 24, 2024 — Six out of 57 accepted RECOMB24 papers are authored by current or former Kingsford group members. These include 2 papers authored by current Ph.D. students, and 4 authored by former trainees of the group.

Yihang Shen successfully defends his dissertation

April 10, 2024 — Yihang Shen has successfully defended his Ph.D. Congratulations Dr. Shen!

Yihang Shen (2024) Automated hyper-parameter tuning and its applications in computational biology. Ph.D. Thesis (CMU-CB-24-100).

Carl Kingsford named ISCB Fellow

March 8, 2024 — Carl Kingsford, Herbert A. Simon Professor of Computer Science in the Ray and Stephanie Lane Computational Biology Department, has been elected as a Fellow of the International Society of Computational Biology (ISCB).

The ISCB notes that Carl has been chosen for this honor because he “is a trailblazer in computational molecular biology, showcasing sustained innovation in scalable algorithmic approaches.”

k-nonical space: sketching with reverse complements

Guillaume Marçais, C.S. Elder, and Carl Kingsford (2024) k-nonical space: sketching with reverse complements. Bioinformatics.

Sequences equivalent to their reverse complements (i.e., double-stranded DNA) have no equivalent in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms.

Adaptive, sample-specific parameter selection for more accurate transcript assembly

Yihang Shen, Zhiwen Yan, and Carl Kingsford (2024) Adaptive, sample-specific parameter selection for more accurate transcript assembly. bioRxiv.

Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples.

Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation

Minh Hoang and Carl Kingsford (2024) Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation. ICLR 2024.

We tackle the problem of meta-learning across heterogenous tasks. This problem seeks to extract and generalize transferable meta-knowledge through streaming task sets from a multi-modal task distribution. The extracted meta-knowledge can be used to create predictors for new tasks using a small number of labeled samples. Most meta-learning methods assume a homogeneous task distribution, thus limiting their generalization capacity when handling multi-modal task distributions. Recent work has shown that the generalization of meta-learning depends on the similarity of tasks in the training distribution, and this has led to many clustering approaches that aim to detect homogeneous clusters of tasks. However, these methods suffer from a significant increase in parameter complexity.

A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements

C.S. Elder, Minh Hoang, Mohsen Ferdosi, and Carl Kingsford. (2024) A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. RECOMB 2024.

The Beltway and Turnpike problems entail the reconstruction of circular and linear one-dimensional point sets from unordered pairwise distances. These problems arise in computational biology when the measurements provide distances but do not associate those distances with the entities that gave rise to them. Such applications include molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes (since sequencing and mass spec technologies can give lengths or weights, usually without connecting them to end points). Practical algorithms for Turnpike are known when the distance measurements are accurate, but both problems become strongly NP-complete under any level of measurement uncertainty. This is problematic since all known applications experience some degree of uncertainty from uncontrollable factors.

Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework

Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, and Carl Kingsford (2024) Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework. Genome Research 34:1987-1999.

Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps.

Improving Hi-C contact matrices using genome graphs

Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang, and Carl Kingsford (2024) Improving Hi-C contact matrices using genome graphs. RECOMB 2024.

Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes.

2023

Sketching methods with small window guarantee using minimum decycling sets

Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2023) Sketching methods with small window guarantee using minimum decycling sets. Journal of Computational Biology.

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment.

Minh Hoang successfully defends his dissertation

October 16, 2023 — Minh Hoang has successfully defended his Ph.D. dissertation, which is titled “Practical Methods for Automated Algorithm Design in Machine Learning and Computational Biology.” He will join Princeton University as a postdoc. Congratulations Dr. Hoang!

Minh Hoang (2023) Practical Methods for Automated Algorithm Design in Machine Learning and Computational Biology. Ph.D. Thesis.

Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC

Hossein Asghari, Ehsan Haghshenas, Roby Thomas, Eric Schultz, Rob Patro, Stan Skrzypczak, and Carl Kingsford (2023) Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC. Cancer Res (2023) 83 (7_Supplement): 1400.

Current Group Members

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011 and Before