Finding ranges of optimal transcript expression quantification in cases of non-identifiability


Current expression quantification methods suffer from a fundamental but under-characterized type of error: the most likely estimates for transcript abundances are not unique. Probabilistic models are at the core of current quantification methods. The scenario where a probabilistic model has multiple optimal solutions is known as non-identifiability. In expression quantification, multiple configurations of transcript expression may be equally likely to generate the sequencing reads and the underlying true expression cannot be uniquely determined. It is still unknown from existing methods what the set of multiple solutions is or how far the equally optimal solutions are from each other. Such information is necessary for evaluating the reliability of analyses that are based on a single inferred expression vector and for extending analyses to take all optimal solutions into account. We propose methods to compute the range of optimal estimates for the expression of each transcript when the probabilistic model for expression inference is non-identifiable. The accuracy and identifiability of expression estimates depend on the completeness of input reference transcriptome, therefore our method also takes an assumed percentage of expression from combinations of known junctions into consideration. Applying our method on 16 Human Body Map samples and comparing with the single expression vector quantified by Salmon, we observe that the ranges of optimal abundances are on the same scale as Salmon’s estimate. Analyzing the overlap of ranges of optima in differential expression (DE) detection reveals that the majority of predictions are reliable, but there are a few unreliable predictions for which switching to other optimal abundances may lead to similar expression between DE conditions. The source code can be found at