More accurate transcript assembly via parameter advising


Computational tools used for genomic analyses are becoming more accurate but also increasingly sophisticated and complex. This introduces a new problem in that these pieces of software have a large number of tunable parameters which often have a large influence on the results that are reported. We quantify the impact of parameter choice on transcript assembly and take some first steps towards generating a truly automated genomic analysis pipeline by developing a method for automatically choosing input-specific parameter values for reference-based transcript assembly. By choosing parameter values for each input, the area under the receiver operator characteristic curve (AUC) when comparing assembled transcripts to a reference transcriptome is increased by 28.9% over using only the default parameter choices on 1595 RNA-Seq samples in the Se- quence Read Archive. This approach is general, and when applied to StringTie it increases AUC by 13.1% on a set of 65 RNA-Seq experiments from ENCODE. Parameter advisors for both Scallop and StringTie are available on Github

The 2019 ICML Workshop on Computational Biology; journal version in Journal of Computational Biology