Estimating mutual information under measurement error


Mutual information is widely used to characterize dependence between biological signals, such as co-expression between genes or co-evolution between amino acids. However, measurement error of the biological signals is rarely considered in estimating mutual information. Measurement error is widespread and non-negligible in some cases. As a result, the distribution of the signals is blurred, and the mutual information may be biased when estimated using the blurred measurements. We derive a corrected estimator for mutual information that accounts for the distribution of measurement error. Our corrected estimator is based on the correction of the probability mass function (PMF) or probability density function (PDF, based on kernel density estimation). We prove that the corrected estimator is asymptotically unbiased in the (semi-) discrete case when the distribution of measurement error is known. We show that it reduces the estimation bias in the continuous case under certain assumptions. On simulated data, our corrected estimator leads to a more accurate estimation for mutual information when the sample size is not the limiting factor for estimating PMF or PDF accurately. We compare the uncorrected and corrected estimator on the gene expression data of TCGA breast cancer samples and show a difference in both the value and the ranking of estimated mutual information between the two estimators.

bioRxiv 852384