Querying a short read database for a transcript of interest is a fundamental problem in biology. Yet such queries are computationally intensive and scale linearly with the size of the data being searched. This leads to a computational bottleneck in which large databases of sequencing reads are compiled but never investigated systematically. To address this problem, we developed the Sequence Bloom Tree (SBT) data structure to facilitate searching short-read expression experiments for transcripts of interest. Rather then naively explore every file in a database, the SBT prunes files which do not contain the query with high probability and thus scales linearly with the number of experiments containing the query rather then the total size of the experiment set.
The SBT is built upon the Jellyfish library bloom filter implementation and the default settings are designed as a reasonable compromise between speed, storage cost, and accuracy. In the paper, we demonstrate that the SBT can search multi-terabyte databases substantially faster than any existing tool with reasonable accuracy and negligable storage costs in both memory and RAM.