The ability to efficiently query genomic variation data from thousands of samples is critical to achieve the full potential of many medical and scientific applications such as personalized medicine. We present VariantStore, a system for efficiently indexing and searching millions of genomic variants across thousands of samples.
We show the scalability of VariantStore by indexing genomic variants from the TCGA-BRCA project containing 8640 samples and 5M variants in ~ 4 Hrs and the 1000 genomes project containing 2500 samples and 924M variants in ~ 3 Hrs. Querying for variants in a gene takes between 2 milliseconds to 3 seconds using memory only ~ 10% of the size of the full representation. As a baseline, VariantStore outperformed VG toolkit by 3X in terms of memory-usage and construction time and uses 25% less disk space although VG toolkit does not support variant queries.