Adaptive, sample-specific parameter selection for more accurate transcript assembly


Motivation: Transcript assemblers are tools to reconstruct expressed transcripts from RNA-seq data. These tools have a large number of tunable parameters, and accurate transcript assembly requires setting them suitably. Because of the heterogeneity of different RNA-seq samples, a single default setting or a small fixed set of parameter candidates can only support the good performance of transcript assembly on average, but are often suboptimal for many individual samples. Manually tuning parameters for each sample is time consuming and requires specialized experience. Therefore, developing an automated system that can advise good parameter settings for individual samples becomes an important problem. Results: Using Bayesian optimization and contrastive learning, we develop a new automated parameter advising system for transcript assembly that can generate sets of sample-specific good parameter candidates. Our framework achieves efficient sample-specific parameter advising by learning parameter knowledge from a representative set of existing RNA-seq samples and transferring the knowledge to unseen samples. We use Scallop and StringTie, two well-known transcript assemblers, to test our framework on two collections of RNA-seq samples. Results show that our new parameter advising system significantly outperforms the previous advising method in each dataset and each transcript assembler.

Under review