A statistical method for identifying consistently important features across samples


In many applications, features with consistently high measurements across many samples are particularly meaningful and useful for quality control or biological interpretation. Identification of these features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values. We present a general method called conserved feature discovery (CFD) for identifying features with consistently strong signals across multiple conditions or samples. Given real-valued data, CFD requires no parameters, makes no assumptions on the underlying sample distributions, and is robust to differences across these distributions. We show that with high probability CFD identifies all true positives and no false positives under certain assumptions on the median and variance distributions of the feature measurements. Using simulated data, we show that CFD is tolerant to a small percentage of poor quality samples and robust to false positives. Applying CFD to RNA sequencing data from the Human Body Map project and GTEx, we identify lists of housekeeping genes as highly expressed genes across tissue types and compare to previous results in this domain. CFD is consistent between the two data sets, and identifies lists of genes enriched for basic cellular processes as expected. The framework can be easily adapted for many data types and desired feature properties.