Red Hat Research Quarterly

Protecting data privacy: a look in our current toolkit

Gordon Haff

Gordon Haff is a Technology Advocate at Red Hat, where he works on emerging technology product strategy, writes about tech trends and their business impact, and is a frequent speaker at customer and industry events. His books include How Open Source Ate Software, and his podcast, in which he interviews industry experts, is Innovate @ Open.

Related Projects

Article featured in

Red Hat Research Quarterly

November 2023

Download PDF

Subscribe now

In this issue

Feature

RISC-V extensions: what’s available and how to find them

Richard Jones

From the Director

Join the research journey

Hugh Brock

News

AI for everyone: NERC expands access

Shaun Strohmer

News

Research demo shows hardware-software co-design in action

Interview

No more gatekeepers: Why technological ignorance is radically dangerous and how an open world will help

Jason Schlessman

Feature

Open research clouds get the skills to pay the bills

Tzu-Mainn Chen

Feature

Protecting data privacy: a look in our current toolkit

Gordon Haff

Column

Focus on edge: security, sustainability, and performance | November 2023

Ahmed Sanaullah

The research uses for data could be endless, but without meeting stringent privacy requirements, some of the most promising analyses may never begin.

“Data is the new oil” is a shorthand generally credited to UK mathematician Clive Humby. The saying got considerable play when “Big Data” was the latest catchphrase around a decade ago. As some were quick to point out, Humby’s complete quotation notes that, like oil, data needs to be refined before we can actually use it.

Today, we see data refinement in the form of large language models (LLMs) and other innovations. At the same time, there’s a growing awareness of the liabilities that source data can bring to organizations, including legal action, reputational risk, and regulatory scrutiny.

Over the past few years, Red Hat Research and academic partners have been involved with projects exploring a variety of security and privacy questions related to data, including secure multiparty computation, differential privacy, confidential computing techniques (including both trusted execution environments and fully homomorphic encryption), and digital sovereignty. This article reviews these technologies and provides some updates.

Secure multiparty computation

Secure multiparty computation (MPC) is a cryptographic protocol that allows multiple parties to jointly compute a function over their inputs without revealing their private data to each other. It is a technique for secure distributed computing that enables parties to collaboratively analyze data without compromising its confidentiality or privacy.

MPC is based on the idea of secret sharing, which involves dividing a secret into multiple shares and distributing them among the parties. Each party has a share of the secret, but no single party has enough information to reconstruct the secret on its own. This ensures that the parties cannot learn each other’s private data, even if some of them are colluding with each other. Essentially, MPC replaces a trusted third party with a cryptographic protocol.

A concrete past example of using this technique comes from research led by Boston University’s Azer Bestavros. It used payroll data from Boston-area companies that was securely collected and redistributed for wage-gap analysis without any company having access to the dataset as a whole. (See also “Conclave: secure multiparty computation on big data” and “Role-based ecosystem for the design, development, and deployment of secure multiparty data analytics applications.”) Work has also begun in the Red Hat Collaboratory at Boston University on developing a unikernel implementation of Secrecy, a relational MPC framework for privacy-preserving collaborative analytics as a service.

Differential privacy

Data from healthcare records can be a considerable boon for scientific research. However, patient data has enormous privacy implications. Even if an organization using the data for research is generally considered trustworthy, data leaks happen all the time.

The obvious solution is to anonymize the data. But this turns out to be surprisingly hard.

The obvious solution is to anonymize the data. But this turns out to be surprisingly hard. It’s not always clear what can be used to identify someone and what can’t—especially once you start correlating with other data sources, including public ones. A widely published story from the 1990s tells how Latanya Sweeney, an MIT graduate student, was able to identify a supposedly anonymized healthcare record as belonging to then Massachusetts governor William Weld after he collapsed at a local event merely by correlating it with voter records.

Organizations like the US Census have long had to deal with the challenges of publishing large numbers of tables that cut up data in many different ways. Over time, there’s been a great deal of research into the topic, which has led to the creation of various guidelines for working with data in this manner. One of the problems with traditional anonymization methods is that it’s often not well understood how successful they are at actually protecting privacy. Techniques that collectively fall under the umbrella of statistical disclosure control are often based on intuition and empirical observation.

However, in the 2006 paper “Calibrating noise to sensitivity in private data analysis,” Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith provided a mathematical definition for the privacy loss associated with any data release drawn from a statistical database. This approach brought more rigor to the process of preserving privacy in statistical databases. It’s called differential privacy (specifically ε-differential privacy). A differential privacy algorithm injects random data into a data set in a mathematically rigorous way to protect individual privacy.

Because the data is “fuzzed,” to the extent that any given response could instead have plausibly been any other valid response, the results may not be quite as accurate as the raw data, depending upon the technique used. But other research has shown that it’s possible to produce very accurate statistics from a database while still ensuring high levels of privacy.

Differential privacy remains an area of active research. However, the technique is already in use: the Census Bureau used differential privacy to protect the results of the 2020 census.

Trusted execution environments

Other techniques focus more on cloud computing, where data is inherently out of an organization’s direct control. Cryptographic techniques that protect stored and in-transit data are well established. But how about data that is currently being used?

One technique is Trusted Execution Environments (TEE). This form of confidential computing uses secure enclaves within a processor that provide isolation and protection for sensitive data and computations. TEEs are typically implemented using hardware-based security features to create a secure environment isolated from the rest of the system. This isolation ensures that sensitive data and computations cannot be accessed or tampered with, even if the rest of the system is compromised.

A great deal of work in this area is taking place in the Confidential Computing Consortium (CCC), a project community of the Linux Foundation. There are currently seven projects under the CCC umbrella.

Fully homomorphic encryption

A second confidential computing technique is fully homomorphic encryption (FHE). It lets a third party perform complicated data processing without being able to see the data set itself. Homomorphic encryption is essentially a technique to extend public-key cryptography and was in fact first mentioned shortly after the RSA cryptosystem was initially invented.

A big challenge for FHE is that the technique is very expensive computationally and is mostly not practical yet. However, if realized, it would provide an additional level of protection against data leaks when using a public cloud or other service providers to analyze data sets.

*A timeline of encryption milestones leading to the development of FHE*

One approach to this challenge has been investigated in a project at the Red Hat Collaboratory: using FPGAs to accelerate operations (see Rashmi Agrawal and Lily Sturmann, “Preserving privacy in the open cloud: speeding up homomorphic encryption with custom hardware,” in RHRQ 4:3). CPUs aren’t good at exploiting the parallelism of FHE algorithms. GPUs are better but their floating-point units end up sitting idle. Neither can effectively meet the high memory bandwidth requirements of FHE workloads. FPGAs, on the other hand, map well to the requirements while being cheaper and more future-proof than custom ASICs.

The project’s long-term vision is to design a practical, efficient hardware accelerator supporting all four of the homomorphic encryption schemes under consideration by the International Organization for Standardization, and then deploy it in the open cloud to enable privacy-preserving computing systems in the Red Hat Collaboratory.

Digital sovereignty

The final topic isn’t a specific technology—though technology will play into some solutions—so much as it’s about geopolitical trends and concerns.

Even allies may adopt security and data regulations that are unfriendly to certain business models.

Shifting global regulations are creating a requirement for workload placement with greater control over security and data as well as avoiding dependencies on organizations in countries that are hostile to varying degrees or may become so. Even allies may adopt regulations that are unfriendly to certain business models. The Capgemini Research Institute notes that the issues “are not new and have been gaining impetus over the past few years. However, it is a subject that is now under increasing scrutiny because of rising geopolitical tensions; changing data and privacy laws in different countries; the dominant role of cloud players concentrated in a few regions; and the lessons learned through the pandemic.”

The pursuit of digital sovereignty is a complex and evolving issue, and there is no one-size-fits-all approach. Countries will need to find their own unique path to achieving digital sovereignty, taking into account their own specific circumstances and priorities. There will certainly be domino effects. As it gains momentum, the effects on hyperscalers, local partnering requirements, and converged observability across clouds will all be open questions, as will the trajectory of regulatory regimes.

You can explore all Red Hat Research projects related to privacy in the database on our website (research.redhat.com). RHRQ explored this issue in past articles including “How do we reconcile privacy with machine learning?” (RHRQ 1:2, 2019) and “Voyage into the open dataverse: an interview with James Honaker and Mercè Crosas” (RHRQ 2:2 2020).

SHARE THIS ARTICLE

Mental models: Qualitative research to design for Red Hat OpenShift users

Carl Pearson

Brian Dellascio

Sarahjane Clark

To design effectively for our users, we need to learn more about them. If we don’t, we may make a product that our users can’t be efficient in, or worse, a product that our users have no need for in the first place.

From the Director

Let’s help more programmers get into the groove

Hugh Brock

This notion of time is what struck me as so interesting about this issue’s feature on constant-time cryptography. It turns out that a crypto implementation whose execution time varies depending on what you feed to it is inherently leaky. By looking at the inputs to a non-constant-time crypto function, an attacker can infer enough about the secret key the function depends on to guess the key, often trivially. Like a drummer who gets distracted by a solo and rushes or drags the time, the crypto function reflects back information about the secret it is protecting.

Feature

Anchored keys: scaling of in-memory storage for serverless data analytics

Tristan Tarrant

The strategy for scaling data capacity varies according to volume, access patterns, and cost-effectiveness. We look at an approach that achieves optimal results in the context of serverless data analytics. Big data holds great promise for solving complex problems, but data-intensive applications are necessarily limited by the difficulty of supporting and maintaining them. The CloudButton […]

News

Open source researchers in security and education win 2021 innovation awards

Vashek Matyáš

Masaryk University (MU) presented the 2021 MUNI Innovation Awards at its Business Research Forum on November 11, 2021. These awards recognize individual students and staff whose research has been successfully implemented in practice, helped to improve products or services, or in some other way enhanced the social relevance of MU research. Awards went to seventeen […]

Column

Matchmaking for engineers: how we learned to bring research and industry together in a way that works

Ilya Kolchinsky

Successful industry-academia relationships don’t just happen. Here’s what it takes to start a collaboration and make it work. As a research supervisor, one of my most important tasks is finding a good fit between engineers with a problem and academicians who can collaborate with them on a project to explore some aspect of that problem. […]

Feature

Unikernel Linux (UKL) moves forward

Richard Jones

RHRQ first looked at the Unikernel Linux (UKL) project—a joint effort involving professors, PhD students, and engineers at the Boston University-based Red Hat Collaboratory—almost two years ago (RHRQ 3:3, November 2021). This previous article covered the background of unikernels in detail, but in brief: an application links directly to a specialized kernel, a lightly modified […]

News

Red Hat Collaboratory at Boston University granting major awards

Shaun Strohmer

The Collaboratory solicited proposals from BU faculty for both large and small research projects to drive innovation for the open hybrid cloud. The Red Hat Collaboratory at Boston University has moved into a new phase of identifying and funding promising research projects, and the selection process is currently underway. The submission deadline was October 1, […]

Feature

Don’t blame the developers: making security usable for IT professionals

Martin Ukrop

Historically, usability studies have looked mostly at end users, doing focus groups or user testing with customers or the general public. This process often neglected developers, system administrators, and other IT professionals and the systems they use day to day.

Feature

Where will we find the data scientists?

Jennifer Wood

Universities play a primary role in developing data skills, but traditional education alone can’t close the skills gap fast enough. The mismatch between the widespread need for strong data skills and the current workforce is an obstacle for nearly every sector of the economy, which means no single sector can solve it. Collaborative partnerships among […]

Red Hat Research Quarterly

November 2023

Protecting data privacy: a look in our current toolkit

Gordon Haff

Red Hat Research Quarterly

November 2023

Protecting data privacy: a look in our current toolkit

Gordon Haff

Gordon Haff

Related Projects

Red Hat Research Quarterly

November 2023

RISC-V extensions: what’s available and how to find them

Join the research journey

AI for everyone: NERC expands access

Research demo shows hardware-software co-design in action

No more gatekeepers: Why technological ignorance is radically dangerous and how an open world will help

Open research clouds get the skills to pay the bills

Protecting data privacy: a look in our current toolkit

Focus on edge: security, sustainability, and performance | November 2023

The research uses for data could be endless, but without meeting stringent privacy requirements, some of the most promising analyses may never begin.

Secure multiparty computation

Differential privacy

Trusted execution environments

Fully homomorphic encryption

Digital sovereignty

Further reading

Mental models: Qualitative research to design for Red Hat OpenShift users

Carl Pearson

Brian Dellascio

Sarahjane Clark

Let’s help more programmers get into the groove

Hugh Brock

Anchored keys: scaling of in-memory storage for serverless data analytics

Tristan Tarrant

Open source researchers in security and education win 2021 innovation awards

Vashek Matyáš

Matchmaking for engineers: how we learned to bring research and industry together in a way that works

Ilya Kolchinsky

Unikernel Linux (UKL) moves forward

Richard Jones

Red Hat Collaboratory at Boston University granting major awards

Shaun Strohmer

Don’t blame the developers: making security usable for IT professionals

Martin Ukrop

Where will we find the data scientists?

Jennifer Wood