The Experiment Discovery Problem
Public databases such at the NIH Sequencing Read Archive (SRA) now contain hundreds of thousands of short-read sequencing experiments. A major challenge now is making that raw data accessible and useful for biological analysis — researchers must be able to find the relevant and related experiments on which to perform their analyses. A fundamental computational problem towards that effort is the problem of searching for short-read experiments by sequence.
Specifically, given a query string Q and a very large collection of short-read sequencing experiments S1, …, Sn, we want to quickly find the experiments that contain reads that make it likely that Q was among the sequences present.
Naturally, we would like to do this quickly, with as little hardware as possible. This was the motivation behind Sequence Bloom Trees (SBT), which allow for indexing large collections of short-read experiments quickly using a hierarchy of bloom filters. However, while the SBT is capable of rapidly searching a larger dataset then previously reported by any tool, it still is difficult to scale up to the 100-terabase or petabase scale. To scale to larger data sets, we developed the Split Sequence Bloom Tree, which can perform the same functions as an SBT 5-38x faster using 5x less on-disk storage. More details of this work and a more in-depth analysis of its efficiency can be found here (SSBT).
Split Sequence Filters
The principal change between an SBT and a Split SBT is that the information previously stored in a single filter is now split between two filters, a “similarity” filter (sim) and a “remainder” filter (rem). For an arbitrary node u, the similarity filter of u stores the elements of the Split SBT that are universally contained below it in the tree. The remainder filter stores everything in any filter below u that is not stored in the similarity filter.
As an example, the following Sequence Bloom Tree has a universally present bit in the left branch of the tree (highlighted in blue):The corresponding “split” representation stores that universally expressed element in B1, B2, and B3 as a single bit at the root of the subtree that contains these three nodes (seen below). In addition, as the similarity bit encodes all filters below it in the tree, we set that bit to zero for each filter below it in the SSBT.
This idea was simultaneously discovered by Chen et al. One advantage of this is that if a query is present in every experiment in a subtree, then we will be able to tell this at the root of that subtree (since the relevant Sim bits will all be 1), and we save the time it previously took to recurse into the leaves.
Another advantage of this representation is that the bitvectors tend to become sparser (since many bit are represented in fewer filters). This allows the RRR compression of the bit vectors to achieve better compression, and reduces the size of the index overall.
In fact, we can reduce the size even further by removing many of the bits in the filters. That is the idea of “Non-informative bits.”
Non-informative Bits in an SSBT
SSBT encodes the two filters at each node using less space than a single filter through the removal of all non-informative bits from the SSBT. In the context of SSBT, a “non-informative bit” is a bit index i in the node u whose bit value is uniquely determined by the set of all filters in the path from root to u. These bits fall into one of the following three categories:
- If urem[i]=0, then i is non-informative in both the sim and rem filters below u. If a bit in a remainder filter is zero, it means that this bit is not present in any of the filters below u in the SSBT. In the example below, the red 0 bit induces the blue non-informative bits below it.
- If usim[i]=1, then i is non-informative in the rem filter at node u. If a bit in a similarity filter is one (red bit in the example below), it means that this bit was universally conserved in each filter below u but is now being stored as this single bit. Since the remainder filter at u only stores the elements not in usim, urem[i] is always zero in this case and is non-informative (blue bit below).
- If usim[i]=1, then i is non-informative below u in both the rem and sim filters. If usim[i]=1, then urem[i] = 0, so we can apply both cases 1 and 2:
Taking these three cases together, we observe the total set of bits that can be removed as follows:
These non-informative bits are cumulative and this set is determined purely based upon the filter’s immediate parent, so every child of a node contains the same non-informative bits. This is helpful in the final stage of tree construction when we remove every bit highlighted in blue to produce the following SSBT:
This leads to a drastic reduction in overall size of the index on both small-scale indices (left) and large collections of sequencing experiments (right):
The left set of column compares the Bloom Filter Trie (BFT), the SBT (code available here), and the SSBT (SSBT code available here) on a set of 50 sequencing experiments. The right set of columns compares SBT and SSBT on a set of 2,652 sequencing experiments. In both cases, the SSBT achieves ~4-5x reduction in size.