Data Markets for Federated Learning

The Database Community (e.g., see this symposium on data markets) has recently been championing frameworks for data access, search, commodification, manipulation, extraction, refinement, and storage. I heard about data markets from Eugene Wu; it seems like a market area and research opportunity that is ripe for exploration.

In recent work that was presented in-person by Jerry at VLDB 2023, we wrote about a data search platform (called Saibot) that satisfies differential privacy. Essentially, the main algorithm is able to identify augmentations (join or union compatible via the group operations +, x) that will lead to highly accurate models (the evaluation objective is the $\ell_2$ metric but it scales to other objectives as well). This has implications for improving data quality (e..g, perhaps one can identify the right augmentations that will lead to better outcomes) and heterogenous collaboration of all kinds. We evaluate our algorithms on over 300 datasets and compare to leading alternative mechanisms.

Opportunities and Challenges in Data Markets

Data Quality and Accuracy: In my opinion, the biggest challenge to the proliferation of data markets is the availability of high-quality data. No amount of analytical sophistication can get over the basic problem of high-quality data. For example, there are certain subgroups in America (e.g., African-American females) that are under-represented in datasets about academia. In fact, most academic departments in the U.S. do not even have any African-Amerian females. So if a social science researcher wishes to study the academic progression of women in academia and observe trends, the researcher cannot make broad claims about departments that do not even have a single Black woman. So first the researcher must seek out data sources of higher quality. e.g., by including data from HBCUs (Historically Black Colleges and Universities).
Privacy and Security Concerns: Suppose a hospital has data on patient check-ins, health, characteristics, and disorders. If released, the data could help researchers gain valuable insight about diseases in specific areas. Unfortunately, it is known that exactly releasing aggregate information about individuals (even from datasets that are “anonymized”) could lead to de-anonymization/re-identification attacks. Our work on Saibot provides mechanisms to ensure that data search platforms satisfy certain notions of differential privacy.
Collaboration and Knowledge Sharing: Data markets encourage collaboration between organizations and industries. They facilitate the sharing of knowledge and expertise, breaking down silos (especially within academia) and fostering a culture of collective problem-solving. However, one could ask: how much collaboration—between industries—is needed to solve a problem or achieve a certain level of accuracy for statistical models? This problem needs further study.
Economic Value: Some technology companies (e.g., Netflix and Facebook) earn their value proposition (almost) entirely from having lots of users and interactions on their platforms. Having more specific forms of data (e.g., the data on African-American females) could give companies a competitive advantage in data markets. So having access to data markets can create new revenue streams. I would personally like to see more economic analysis of the value of data markets!

Recap: INFORMS 2023 and the Applied Probability Society

I attended INFORMS (for the first time!) 2023, hosted in Phoenix, Arizona 🥵 . It was a nice experience overall! I mostly attended the Applied Probability Society sessions during the conference.

About INFORMS

The Institute for Operations Research and the Management Sciences, or INFORMS, is the world’s largest professional society dedicated to operations research and analytics. With a mission to promote the scientific approach to decision-making, INFORMS plays a critical role in connecting researchers, practitioners, and educators, fostering a vibrant community dedicated to advancing related fields: operations research, statistics, computer science, mathematics, and so on. I also learned a fair bit about what the fields of revenue and supply-chain management are about. The 4-day program had 84 tracks, 11 major tutorials, and hundreds of sessions (one of which I chaired).

Applied Probability Society (APS)

The society is “concerned with the application of probability theory to systems that involve random phenomena” and “members include practitioners, educators, and researchers with backgrounds in business, engineering, statistics, mathematics, economics, computer science, and other applied sciences.” I attended the APS business meeting, where the inaugural Blackwell Award was presented (David Blackwell was an INFORMS fellow) and other APS-specific issues were discussed.

APS Session on “Optimization over Probability Distributions”

I chaired a session with the following talks:

1) Abdul Canatar (Flatiron Institute) on “Out-of-Distribution Generalization in Kernel Regression” https://arxiv.org/abs/2106.02261

2) Prayaag Venkat (Harvard) on “Near-optimal fitting of ellipsoids to random points” https://arxiv.org/abs/2208.09493

3) Ellen Vitercik (Stanford) on “Leveraging Reviews: Learning to Price with Buyer and Seller Uncertainty” https://arxiv.org/abs/2302.09700

4) R. Srikant (UIUC) on “Crowdsourcing with Hard and Easy Tasks”

5) Daniel Alabi (Columbia) on “Degree Distribution Identifiability of Stochastic Kronecker Graphs” https://arxiv.org/abs/2310.00171

Until the conference, I hadn’t heard the speakers talk about these specific works. So the APS session was a direct way to learn about what they have been up to recently. Overall, I learned a lot from the conference and I’m looking forward to attending future iterations.

Some Resources for Learning about Spectral Techniques

Recently, I have been asked at least three times about resources for learning spectral techniques. Three is my threshold that warrants a short blog post 🙂

First, what are spectral techniques? It is a set of mathematical methods and tools that involve the analysis of the eigenvalues and eigenvectors of matrices associated with a given mathematical or data structure. Example of such data structures could be the adjacency matrix of a graph or the unitary matrix representing common quantum logic gates.

The use of spectral techniques are generally popular within fields (e.g., math, physics) that deal with data decompositions or linear transformations. Here are some resources (in mostly random order) for learning about spectral techniques:

NetworkX: A python playground for network analysis in Python.
The book on “Spectral Graph Theory” by Fan R. K. Chung which covers the mathematical basics of spectral graph theory.
The reference ”Spectral and Algebraic Graph Theory” by Daniel A. Spielman which contains background material and applications of spectral graph theory. It also has some accompanying code to further reinforce understanding! This is a huge plus.

Why (and How) Things Work

In Honor of David Blackwell