Voyage into the open Dataverse

Red Hat Research Quarterly

Red Hat Research Quarterly

Voyage into the open Dataverse

We spoke about the importance of data sharing and privacy preservation, in both scientific and computer technology domains, with James Honaker and Mercè Crosas, two of Harvard’s leaders in these fields.

about the author

Sherard Griffin

Sherard Griffin is a Director at Red Hat in the AI Center of Excellence. His primary responsibility is the development of Open Data Hub, a community-driven reference architecture for building an AI-as-a-service platform on Red Hat® OpenShift®.  He is also responsible for the deployment of Open Data Hub,  which processes hundreds of gigabytes of data per day. The stored results are made available to analysts, developers, and data scientists across the company.  

Article featured in

Red Hat Research Quarterly

August 2020

In this issue

The next frontier in balancing data sharing needs with privacy protection

We spoke about the importance of data sharing and privacy preservation, in both scientific and computer technology domains, with James Honaker and Mercè Crosas, two of Harvard’s leaders in these fields. They discussed how we can make open source solutions for storing and sharing richly detailed information about experiments, software, and systems more available to all.

Sherard Griffin: James and Mercè, I’d love to hear more about what you do at Harvard and the projects you’re associated with. 

James Honaker: I work with a team in the computer science department doing research on adjacent computing. We take algorithms being developed by researchers and try to build prototypes from them. Or we take prototypes we’ve developed and try to turn them into more robust user tools—basically try to get code out of theory and into tools.

Red Hat’s Sherard Griffin (top right) and Heidi Dempsey (bottom right) sit down for a virtual chat with James Honaker (bottom left) and Mercè Crosas (top left).

One we’ve focused on a lot has been building systems for privacy preservation: using the mathematical theories, definitions, and algorithms of differential privacy in a library- and researcher-oriented way, so pragmatic researchers, data scientists, and statisticians can leverage differential privacy without having any expertise.

Mercè Crosas: I have two roles at Harvard. One is a university-wide role, with the Senior Leadership Team of Harvard IT. The university has a data management office, so I work with the CIO to help organize how we use data across Harvard. The other role is within the Institute for Quantitative Social Science (IQSS), as a senior science and technology officer. I’ve been with the Institute for about 15 years, and within this time we’ve built a public platform to build data repositories called Dataverse. 

Dataverse has an open source community around it and a couple of research projects associated with it. More recently, several projects within IQSS work on improving research by building research tools or providing data science consulting and training services. We opened the Project for Differential Privacy in the context of how data repositories will get integrated with data privacy solutions.

Sherard: James, you mentioned that you focus a lot on building out systems and tooling for data privacy. What sparked your interest in data privacy?

It’s a real barrier to entry that every single person has to start from scratch. So why don’t we all build it together?
—James Honaker

James: It started when I was at IQSS with Mercè. The CS theorists, people who work on proving mathematically what’s possible with computation, had this large body of literature proving what things were possible or impossible in terms of privacy. IQSS was partnering with them to work out if any of these could be used as a tool for applied researchers. Somehow I got stuck in the bridging role for a little while, which was, “people don’t know how to talk to these people, so why don’t you go be the person who talks to them?”

Sherard: That’s certainly a valued asset.

James: Yes, I think it means I had grasped just enough content in enough fields to play translator. As it turned out, some of that bridging work involved communicating with scholars about what people did with data and how some of the things we already understood in statistics could translate. I ended up moving to their institute so I could do more work with them.

Sherard: A lot of what we do at Red Hat and the office of the CTO is similar. We try to bridge the vision of what a product is working on versus what customers need. I’m curious how that gets incorporated with what companies like Red Hat and the open source communities do. What does that look like from your end, where you’re taking something theoretical and making a repeatable tool? 

James: Getting code that works within the pragmatics of how computing actually occurs is something we push on. For example, these proofs are often written in the space of real numbers, but computers have to use floating points, and that difference tampers with the proofs. Or sometimes, things are sort of order of magnitudes and they don’t care about the exact constants. Those get wiped out of the proof, but if you can change the utility of an algorithm by a factor of two or ten, that matters a lot to a researcher, whether they have half their data or a tenth of their data ending up usable.

We also work with use case partners. When you talk with an analyst who’s got a very specific use case, they have private data from one government entity, private data from another government entity, and maybe it has some weird distribution. And oh, this is the very hard way that we have to join them. All of those pragmatics end up pointing out things that weren’t quite covered by the original theory. 

Sherard: Does that feedback work its way back to the researchers? I imagine the real world use cases are far different than what you would see in an isolated research environment.

James: The feedback loop is definitely a nice part of the iteration. It informs research agendas. Sometimes we pitch in on that. Sometimes we have ideas and then somebody will go off and prove that our idea actually makes sense. Sometimes, we can point out, “Look, your algorithm will be a lot more useful if you could do X, Y, and Z.” Then they go and work on that. 

Sherard: Mercè, tell us about how you got into data security and the privacy side of your work. 

In the last years, there’s been a move to more open science and data. But there are also requirements from funding organizations.
—Mercè Crosas

Mercè:  So, part of the data works project we started at IQSS was making the data available as openly as possible. In the last years, there’s been a move to more open science and data. But there are also requirements from funding organizations.  If you publish in a journal, you have to make the dataset you used publicly available. The problem is, sometimes research uses sensitive data containing information about individuals.  If we cannot make it available, we cannot reuse datasets to reproduce results that have been published.

So, how could we find something in between? How could we build a platform from open data, using even the most sensitive dataset, in a way that organizations would still allow us to access some of the summary statistics or some of the results of their dataset? It was a practical necessity to get involved and provide solutions for accessing sensitive data for research.

Sherard: For those not familiar with Dataverse, can you give a couple sentences on what it is and how we can tie it back to differential privacy?

Mercè: Yes. Dataverse is a software platform enabling us to build a real data repository to share research datasets. The emphasis is on publishing datasets associated with research that is already published. Another use of the platform is to create datasets that could be useful for research and making them available more openly to our research communities. 

Sherard: One of the challenges we’ve faced at Red Hat is that the datasets we needed from a partner to create certain machine learning models had to have a fair amount of information. Unfortunately, the vendor had challenges sharing that data, because it had sensitive information in it. Have you run into scenarios where you’re trying to do analysis or machine learning on this kind of data? How would differential privacy or OpenDP help out?

The goal is to build a community of people who are willing to pitch in.
—James Honaker

James: That’s a great use case. So, differential privacy is a mathematical definition, not an algorithm. An algorithm either meets the definition or it doesn’t. If an algorithm is proven to meet that definition, you can reason about the use of that algorithm formally and make guarantees. Loosely speaking, the guarantee is that the releases, query answers, or models that your differentially private algorithm provides won’t leak information about any one individual. They can’t even learn whether or not I was in the dataset in the first place or get a distribution of answers affected by my information. It’s a very, very high, gold-standard guarantee. 

Normally, these algorithms are adding a small amount of noise sufficient to drown out the contribution of any one individual in the dataset.  So you don’t have to strip out all of these potentially sensitive attributes, because there’s no way to attach those to any individual. Stripping out sensitive data makes analysis really hard to run. Maybe the relationship between the sensitive variable and some other characteristic you care about is the fundamental quantity of interest. You strip out the sensitive data, you can’t do anything. 

Sherard: It sounds like it would be quite a challenge to know how far to obscure the data or how much noise to add to make sure you don’t add too much. 

James: You cut exactly to the heart of the question. I call it very fine-tuned lying. If the noise is too great, you’re losing more utility than you needed. If the noise is too low, the privacy guarantee goes away. The point is to balance that noise exactly; that’s why the ability to reason formally about these algorithms is so important. There’s a tuning parameter called Epsilon. If an adversary, for example, has infinite computational power, knows algorithmic tricks that haven’t even been discovered yet, Epsilon tells you the worst case leakage of information from a query. So what you decide is, “Okay, how far do I think I am from that worst case? How much information would I be willing to give such an attacker in order to release a query?” That tells you what that noise has to be. 

Sherard: I want to come back to reproducibility. In software, we try not to release without having some level of continuous integration testing and ability to validate that the application will behave a certain way given certain parameters. What does that mean in the world of data science?

Mercè: Good question, and not an easy one. Computational reproducibility is a big topic. When I refer to reproducibility, I’m talking about using the same data and the same methods to see if you end up with the same results. So it’s validating the work, the scientific outcome of your research, which is an eternal battle. 

Sherard: I imagine you would need to be able to share the data that created the results in the first place.

Mercè: Exactly. That’s the connection with our work with Dataverse. A lot of the work connected to open science and open data is to make that possible. We also say the code and the software should be open, so you could reuse the same computation.

Sherard: How do you see the difference between differential privacy and OpenDP and some other privacy protecting technologies, like multiparty computing? Why is encryption not enough?

James: They’re complementary. Most uses of encryption are about confidentiality. If I’ve got sensitive data, I don’t want somebody to hack the system and do an end run around my interface and pull the data out or monitor it in transit. 

But when I run an analysis on the data, I’m creating an answer I’m going to send out into the wild. I want to make sure that answer, after it leaves the system, can’t be plugged into some attack that leaks information. That’s what differential privacy is giving you. It’s giving you an interface between computations you might want to run on the data and what you can publish in the outside world.

Once you start answering queries on a dataset, you are necessarily leaking information. Differential privacy ensures you never answer the questions too precisely or answer too many questions. Differential privacy is not encrypting the data. It’s how you release things out of the system.

With multiparty computation, there are interesting connections. Multiparty computation is often how we share confidential data. How do we allow multiple people to have confidential data yet reach a common answer? Multiparty computation is one answer. There are differentially private ways of creating a differentially private answer. I create a differentially private answer, then we work out how to add the two together. That’s another approach. There are researchers here at Harvard and at Boston University looking at the connections between the two. 

Sherard: That’s a very good answer. So, I’m thinking of these scientists whose timelines are decades, as far as how long they want to be reproducible. Our timelines in hardware and software are a couple of years. What can you do to help with that difference?

Mercè: We have a team looking into this problem and looking at ways, for example, of using Docker containers to encapsulate the environment, including code and data, where you run computational analysis and are able to reuse it. Of course, everything has a timeframe. The containers might even change. You need to look at it the way libraries and archives have done for decades to see how you solve the preservation problem. The best solution is one that’s more preservable and least dependent on proprietary software. However, I don’t think it’s a problem that is solved for every case. Some things might be just too difficult to preserve long term.

One thing we’re looking into is how to summarize the metadata that goes with the data so it’s easier to rerun. Many times, the problem is the documentation. We’re trying to see simple ways of summarizing this in a simple format for anybody to reuse.

Sherard: How can engineers collaborate on OpenDP and some of the differential privacy work you’re doing?

James: What we’ve seen is that industry groups or tech companies build their own end-to-end, bespoke differential privacy system to solve one question they really care about, and they do it really well. And lots of academic researchers build an end-to-end prototype that demonstrates one thing they’ve been researching, and they do that really well. There’s a lot of overlap of cryptography, and the fundamental rule of cryptography is you never want to roll your own, right? You want everybody to be using the same underlying library, because then everybody else has vetted it. 

It’s a real barrier to entry that every single person has to start from scratch. So why don’t we all build it together? Why don’t we build one underlying library that everybody’s contributing to, that’s flexible enough that industry can use it for their problems, researchers can use it for their own cutting-edge directions. 

Mercè did a lot of work building this OpenDP conference where people were discussing use cases like, “Here’s what my data is, here’s what analysts need to be able to do. This is what you need to be able to solve.” And people were talking about systems engineering and saying, “Okay, this is how I put my data in the cloud, and this is how I need to be able to access it, and this is how it scales. Make sure it works.” So there’s lots of places for people to contribute their talents. The goal is to build a community of people who are willing to pitch in.

Mercè: We had a session on collaboration in the OpenDP workshop where we talked about the code library, the center of all this work, because you need that to build anything on top of it. But then, there is a whole layer of tooling that could be making use of that library to access the user interface, run queries. Then there is another layer of these end-to-end systems that says, well, let’s say Red Hat wants to use data they don’t have or the data within some of the tools to provide a system that includes differential privacy. Then we find ways to partner in building this.

Sherard: What challenges do you foresee OpenDP facing in the near future? 

Mercè: One challenge to getting a differential privacy library is one many products face: getting this idea out there for people to use it. Most of the components are already there, we just need to release it in a way where we feel comfortable it can be transferred and verified. 

James: I hope this is a sign of our mutual respect and adoration of each other, that Mercè sees the hardest thing as the thing I do—building the library—and I would say the hardest thing is what Mercè does. There are all these groups saying, “This is a great idea. I want it to work in my context.” They’re all pulling in slightly different ways. How do you build a community that’s cooperative and balances all those interests? That seems like the phenomenally hard challenge. 

Sherard: One last question. When do you think differential privacy will be used commonly in datacenters? Is this something we can achieve?

James: That’s a good question. 

So, I got involved in this project just as my daughter was being born. And the whole literature was very new. It was all about potential and theory and abstract things. Now my daughter’s in kindergarten, and we’ve got actual systems that people are really using, and I think, “Okay, now the literature is sort of in kindergarten.” I’m hoping by the time she gets to high school the literature will also be in high school, which is to say it’ll know most of the subjects reasonably well and it’ll be pretty well rounded. That’s what Mercè and I are trying to push, but I hope it doesn’t take ten years. I hope it’s a very gifted child who gets to high school in five years.

More like this