Simply put, the problem of quantile estimation has to do with estimating the empirical or population cumulative distribution function (CDF) of a sample or a distribution. The amount of information or samples the estimator needs is determined by how many times we will need to query the CDF and how accurate we need our answers to be. In this short exposition, we will focus on empirical quantile estimation.
We define the -quantile to be the element
with rank of
amongst
elements. The
approximate
-quantile is any element with rank between
and
.
Example
Given sample {29, 62, 83, 53, 12, 21, 50, 87, 86, 2}, the 0.1-quantile, 0.2-quantile, and median (0.5-quantile) are 2, 12, 50 respectively. A straightforward way to obtain quantiles is to sort the dataset and the pick the element at the position. This method only works in the offline (non-streaming) setting. For
-approximate quantiles, the 0.1-approximate 0.2-quantile consists of the following elements {12, 21, 29}.
Greenwald-Khanna
From previous work [1], it is known that exact computation of quantiles cannot be done in space that is sub-linear in the stream length (if one is allowed at most one pass over the dataset). So one must resort to approximation in order to save space. This becomes especially important when performing analyses on really large datasets — think trillions of data records. In such cases, approximate computation of quantiles in bounded space might be the only feasible solution. Greenwald and Khanna (2001) present a deterministic algorithm for computing approximate quantiles in bounded space. We build on their algorithms to provide private quantiles [2].
With Differential Privacy
Recently, my collaborators and I came up with algorithms for computing DP quantiles in bounded space based upon the Greenwald-Khanna data structure for quantile estimation: (i) Through the exponential mechanism, we sample from intervals defined by the sketch; (ii) We build histograms using the sketch structure. Our algorithms use poly-logarithmic space and are experimentally evaluated on synthetic and real-world datasets [2]. We find that the histogram-based algorithms perform best on more concentrated datasets and for the multiple quantiles problem. In addition, we show theoretical utility bounds specialized to the normal distribution. Our main algorithm uses the exponential mechanism to privately sample from intervals defined by the Greenwald-Khanna sketch.
References
[1] J.I. Munro and M.S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science 12, 3 (1980), 315–323.
[2] Daniel Alabi, Omri Ben-Eliezer, and Anamay Chaturvedi. Bounded space differentially private quantiles. To appear in the Transactions on Machine Learning Research, 2023.
[3] Michael Greenwald and Sanjeev Khanna. Space-Efficient Online Computation of Quantile Summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 21-24, 2001. 58–66.
https://doi.org/10.1145/375663.375670