Estimating Quantiles in Bounded Space

Simply put, the problem of quantile estimation has to do with estimating the empirical or population cumulative distribution function (CDF) of a sample or a distribution. The amount of information or samples the estimator needs is determined by how many times we will need to query the CDF and how accurate we need our answers to be. In this short exposition, we will focus on empirical quantile estimation.

We define the q-quantile to be the element x_q with rank of \lfloor q \cdot n\rfloor amongst n elements. The \alpha approximate q-quantile is any element with rank between \lfloor (q-\alpha)\cdot n\rfloor and \lfloor (q+\alpha)\cdot n\rfloor.

Example

Given sample {29, 62, 83, 53, 12, 21, 50, 87, 86, 2}, the 0.1-quantile, 0.2-quantile, and median (0.5-quantile) are 2, 12, 50 respectively. A straightforward way to obtain quantiles is to sort the dataset and the pick the element at the \lfloor q\cdot n\rfloor position. This method only works in the offline (non-streaming) setting. For \alpha-approximate quantiles, the 0.1-approximate 0.2-quantile consists of the following elements {12, 21, 29}.

Greenwald-Khanna

From previous work [1], it is known that exact computation of quantiles cannot be done in space that is sub-linear in the stream length (if one is allowed at most one pass over the dataset). So one must resort to approximation in order to save space. This becomes especially important when performing analyses on really large datasets — think trillions of data records. In such cases, approximate computation of quantiles in bounded space might be the only feasible solution. Greenwald and Khanna (2001) present a deterministic algorithm for computing approximate quantiles in bounded space. We build on their algorithms to provide private quantiles [2].

With Differential Privacy

Recently, my collaborators and I came up with algorithms for computing DP quantiles in bounded space based upon the Greenwald-Khanna data structure for quantile estimation: (i) Through the exponential mechanism, we sample from intervals defined by the sketch; (ii) We build histograms using the sketch structure. Our algorithms use poly-logarithmic space and are experimentally evaluated on synthetic and real-world datasets [2]. We find that the histogram-based algorithms perform best on more concentrated datasets and for the multiple quantiles problem. In addition, we show theoretical utility bounds specialized to the normal distribution. Our main algorithm uses the exponential mechanism to privately sample from intervals defined by the Greenwald-Khanna sketch.

References

[1] J.I. Munro and M.S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science 12, 3 (1980), 315–323.

[2] Daniel Alabi, Omri Ben-Eliezer, and Anamay Chaturvedi. Bounded space differentially private quantiles. To appear in the Transactions on Machine Learning Research, 2023.

[3] Michael Greenwald and Sanjeev Khanna. Space-Efficient Online Computation of Quantile Summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 21-24, 2001. 58–66.
https://doi.org/10.1145/375663.375670

Graph Generation Algorithms

As you might have already heard about, generative AI—especially those based on large language models like ChatGPT—are quite powerful tools that (are supposed to) serve to augment human routines and reasoning. Specifically, ChatGPT builds on the GPT (Generative Pre-trained Transformer) model that has been pre-trained on LOTS of text data to be able to generate contextually appropriate responses to user input. If you have not tried ChatGPT yet, I highly recommend giving it a go!

Other types of generative AI technologies include the GAN (Generative Adversarial Network) and VAE (Variational Autoencoder) deep learning models. From a simple web search, you can learn more about the applications of generative technologies, ranging from creating poetry to generating realistic-looking images and video. Another important application of generative technologies is graph data, which defines our friendships on Facebook or relationships on LinkedIn. In general, graphs are a powerful abstraction for representing and analyzing complex systems, from biological systems to social networks. However, generating realistic graphs that capture certain properties of naturally-occuring networks can be challenging. A promising approach to tackle this problem is the use of stochastic Kronecker graphs.

Stochastic Kronecker graphs can be used to generate graphs from a starting seed and by applying Kronecker multiplication/product. The Kronecker product is a mathematical operation that combines two matrices to produce a larger matrix. By repeatedly applying this operation, we can generate a large and complex graph. For example, a starting 2×2 matrix seed could look like this

S = 1/16\cdot\begin{pmatrix}9 & 3 \\ 3 & 1 \end{pmatrix}.

To generate a graph using the seed S, we can apply the Kronecker product on S, repeatedly, so that S^{\otimes t} would define a distribution over a graph on 2^t vertices such the any entry (u, v)\in S^{\otimes t} would define the probability of node u being connected to node v. This process, essentially, defines the generative process of stochastic Kronecker graphs. An alternative way to look at the resulting matrix is that each element of the resulting matrix is chosen to be 1 with some probability p, and 0 with probability 1-p. This probability is chosen independently for each element of the matrix.

It is well-known that many real-world networks have degree distributions that follow a lognormal or power-law distribution. However, stochastic Kronecker graph models cannot always generate graphs with degree distributions that follow lognormal or power-law. The degree distributions seem to have significant oscillations. In a recent work, my friend/collaborator and I (paper forthcoming) analyze techniques that could get rid of oscillations in the degree distribution of generated graphs—see Figure 1 below. We believe our work could lead to a better understanding and classification of graph generation algorithms.

Simons Society of Fellows Retreat 2023

I attended the Simons Society of Fellows retreat held March 17 – 20, 2023 in Florida. I asked ChatGPT to write me a blog post about the retreat but it gave the following response:

So the entirety of this blog post was written by me!

I learned a great deal about many topics in the sciences ranging from sphere packing all the way to mating patterns in mosquitoes. See https://www.simonsfoundation.org/event/simons-society-of-fellows-retreat-2023/ for the abstracts of my colleagues. The title of my talk was Polynomial-time Robust Moment Estimation via Semidefinite Programming Hierarchies. I spoke about some aspects of my recent work that uses the sum-of-squares framework to design polynomial-time sample-optimal estimators with robustness and privacy guarantees.

In the course of preparing my talk, I also learned some interesting facts about John Tukey (a pioneer in robust statistics). Some notable quotes attributed to Tukey:

  • “The test of a good procedure is how well it works, not how well it is understood.”
  • “You can’t expect scientists to solve political problems.”
  • “By the end of late 1945, I was a statistician rather than a topologist.”

[1] David R. Brillinger. John W. Tukey: his life and professional contributions. Ann. Statist. 30(6): 1535-1575 (December 2002). DOI: 10.1214/aos/1043351246