DTMol: Pocket-based Molecular Docking using Diffusion Transformers

Haotian Teng, Ran Wang, Yihang Shen, Ye Yuan, Carl Kingsford (2025) DTMol: Pocket-based Molecular Docking using Diffusion Transformers. bioRxiv 648103.

Molecular docking --- predicting the binding structure of a small molecule ligand to a protein --- is a crucial task in computational chemistry and drug discovery. Traditional docking methods relying on scoring functions tend to be slow and inaccurate. Recent deep learning methods, especially diffusion-based generative models, have significantly improved the accuracy and computational efficiency of molecular docking. However, these methods still face challenges, particularly in the pocket-based docking setting, which involves docking a ligand when a protein pocket structure --- a cavity of the protein with potential ligand-binding capabilities --- is given. We introduce DTMol, a novel generative deep learning model designed to tackle the pocket-based molecular docking problem. Our model integrates a pretrained molecular representation framework with a new SE(3)-equivariant diffusion transformer architecture. The pretrained framework generates representations of both protein pockets and ligands, while the diffusion transformer effectively captures interaction information between them. Testing on the PDB-Bind dataset demonstrates that our method outperforms traditional docking methods and deep learning-based baselines. The efficacy of DTMol is further validated through a virtual screening task targeting Janus kinase 2 (PDB ID: 6BBV), followed by experimental validation of the top-ranked compounds via a protein kinase activity assay.

Biological databases in the age of generative artificial intelligence

Mihai Pop, Teresa K Attwood, Judith A Blake et al. (2025) Biological databases in the age of generative artificial intelligence . Bioinformatics Advances, vbaf044 .

Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.

DNA Language Models for RNA Analyses

Shiyi Du, Litian Liang, Jiayi Li, and Carl Kingsford. (2025) DNA Language Models for RNA Analyses. OpenReview .

We introduce novel Adaptive Mixture of Codon Reformative Experts (CodonMoE) that can be incorporated into DNA gLMs in order to adapt them for mRNA-based predictive tasks. We show that, by using this plug-and-play operator, DNA-based gLMs can achieve performance similar to that of RNA-trained models on mRNA tasks. We further show that recent, efficient sub-quadratic DNA-based state space model (SSM) architectures can be used with the CodonMoE to achieve parameter- and computationally-efficient predictions for mRNA tasks. Specifically, experimental results demonstrate that CodonMoE improves diverse DNA-based backbones by a large margin, with some models achieving comparable or superior performance to current state-of-the-art RNA- specific models across several downstream tasks, while reducing both time complexity and model parameters. Our results provide a path for focusing development efforts of gLMs on DNA models, which can then be adapted to mRNA tasks. Because DNA data is more prevalent than assembled mRNA data, and modeling efforts can focus on a single class of model, this is likely to foster improved DNA models for mRNA tasks at lower computational cost and is a significant step towards unifying genomic language modeling.

A synthesis for exactly 3-edge-connected graphs

Carl Kingsford and Guillaume Marçais (2025) A synthesis for exactly 3-edge-connected graphs. Discrete Applied Mathematics 368:18-29.

A multigraph is uniformly 3-edge-connected if there are exactly 3 edge-disjoint paths between any pair of vertices. For example, a uniformly 3-edge-connected graph is obtained from a 3-edge-connected graph by collapsing the nodes connected by more than edge-disjoint paths into supernodes. We characterize the class of uniformly 3-edge-connected graphs, giving a synthesis involving two operations by which every uniformly 3-edge-connected multigraph can be generated. Slightly modified syntheses give the planar uniformly 3-edge-connected graphs and the uniformly 3-edge-connected graphs with the fewest possible edges, generalizing the well-known Harary graphs. In proving the correctness of the synthesis, we also show the existence of a particular type of induced, non-separating cycle in near 3-regular graphs, which is of interest in its own right.

Journal version of "How much data is sufficient to learn high-performing algorithms?"

Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik (2024) Journal version of "How much data is sufficient to learn high-performing algorithms?". J. ACM 71(5):32, pages 1--58 .

The journal version of our 2021 STOC paper "How much data is sufficient to learn high-performing algorithms" has been accepted to the Journal of the ACM.

Voices: How has the AI boom impacted algorithmic biology?

Mona Singh, Cenk Sahinalp, Jianyang Zeng, Wei Vivian Li, Carl Kingsford, Qiangfeng Zhang, Teresa Przytycka, Joshua Welch, Jian Ma, and Bonnie Berger (2024) Voices: How has the AI boom impacted algorithmic biology?. Cell Systems 15(6): P483-487.

The AI boom has affected algorithmic computational biology by further bringing traditional algorithmic thinking and ML and AI techniques closer together. One area where this is particularly true is in the field of automated algorithm design, where AI is used to inform or predict aspects of the design of an algorithm.

Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants

Yutong Qiu, Yihang Shen, and Carl Kingsford (2024) Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and its Variants. Alg. Mol. Biol..

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly.

k-nonical space: sketching with reverse complements

Guillaume Marçais, C.S. Elder, and Carl Kingsford (2024) k-nonical space: sketching with reverse complements. Bioinformatics.

Sequences equivalent to their reverse complements (i.e., double-stranded DNA) have no equivalent in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms.

Adaptive, sample-specific parameter selection for more accurate transcript assembly

Yihang Shen, Zhiwen Yan, and Carl Kingsford (2024) Adaptive, sample-specific parameter selection for more accurate transcript assembly. bioRxiv.

Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples.

Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation

Minh Hoang and Carl Kingsford (2024) Efficient Heterogeneous Meta-Learning via Channel Shuffling Modulation. ICLR 2024.

We tackle the problem of meta-learning across heterogenous tasks. This problem seeks to extract and generalize transferable meta-knowledge through streaming task sets from a multi-modal task distribution. The extracted meta-knowledge can be used to create predictors for new tasks using a small number of labeled samples. Most meta-learning methods assume a homogeneous task distribution, thus limiting their generalization capacity when handling multi-modal task distributions. Recent work has shown that the generalization of meta-learning depends on the similarity of tasks in the training distribution, and this has led to many clustering approaches that aim to detect homogeneous clusters of tasks. However, these methods suffer from a significant increase in parameter complexity.

A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements

C.S. Elder, Minh Hoang, Mohsen Ferdosi, and Carl Kingsford. (2024) A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements. RECOMB 2024.

The Beltway and Turnpike problems entail the reconstruction of circular and linear one-dimensional point sets from unordered pairwise distances. These problems arise in computational biology when the measurements provide distances but do not associate those distances with the entities that gave rise to them. Such applications include molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes (since sequencing and mass spec technologies can give lengths or weights, usually without connecting them to end points). Practical algorithms for Turnpike are known when the distance measurements are accurate, but both problems become strongly NP-complete under any level of measurement uncertainty. This is problematic since all known applications experience some degree of uncertainty from uncontrollable factors.

Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework

Haotian Teng, Marcus Stoiber, Ziv Bar-Joseph, and Carl Kingsford (2024) Detecting m6A RNA modification from nanopore sequencing using a semi-supervised learning framework. Genome Research 34:1987-1999.

Direct nanopore-based RNA sequencing can be used to detect post-transcriptional base modifications, such as m6A methylation, based on the electric current signals produced by the distinct chemical structures of modified bases. A key challenge is the scarcity of adequate training data with known methylation modifications. We present Xron, a hybrid encoder-decoder framework that delivers a direct methylation-distinguishing basecaller by training on synthetic RNA data and immunoprecipitation-based experimental data in two steps.

Improving Hi-C contact matrices using genome graphs

Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang, and Carl Kingsford (2024) Improving Hi-C contact matrices using genome graphs. RECOMB 2024.

Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes.

Sketching methods with small window guarantee using minimum decycling sets

Guillaume Marçais, Dan DeBlasio, and Carl Kingsford (2023) Sketching methods with small window guarantee using minimum decycling sets. Journal of Computational Biology.

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment.

Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC

Hossein Asghari, Ehsan Haghshenas, Roby Thomas, Eric Schultz, Rob Patro, Stan Skrzypczak, and Carl Kingsford (2023) Novel expression biomarkers via prediction of response to FOLFIRINOX (FFX) treatment for PDAC. Cancer Res (2023) 83 (7_Supplement): 1400.