Red Hat Research Quarterly

Red Hat Research Days 2020—What are we thinking about now?

Gordon Haff

Gordon Haff is a Technology Advocate at Red Hat, where he works on emerging technology product strategy, writes about tech trends and their business impact, and is a frequent speaker at customer and industry events. His books include How Open Source Ate Software, and his podcast, in which he interviews industry experts, is Innovate @ Open.

Article featured in

Red Hat Research Quarterly

February 2021

Download PDF

Subscribe now

In this issue

From the Director

Managing large-scale systems

Hugh Brock

News

Red Hat Research Days 2020—What are we thinking about now?

Gordon Haff

News

What to expect from Devconf.cz 2021

Gordon Haff

Column

Shared knowledge or private IP? That is the question

Interview

When good models go bad: Minimizing dataset bias In AI

Sanjay Arora

Feature

Sequential Monte Carlo for streaming data

Rui Vieira

Feature

Blocks, microworlds, puzzles, and adaptivity: teaching programming effectively

GE Foundation scholarship, American residence, Prague

Tomáš Effenberger

Feature

Changing the world, one lesson at a time

Matej Hrušovský

Feature

Efficient runtime verification for the Linux kernel

Daniel Bristot de Oliveira

Feature

PyLadies, welcome to open source!

Petr Viktorin

Highlights from the distributed workflows and infrastructure software tracks

Last September, a quartet of virtual Red Hat Research Days dove into kernel/hardware development, distributed workflows, privacy, and infrastructure software. The virtual conferences also covered some of the research on display in Red Hat Research Quarterly, including James Honaker and Merce Crosas’ work on balancing data sharing and privacy in a recent issue (see issue 2:2, “Voyage into the open dataverse”).

Here, we highlight some of the specific research covered on the distributed workflows and infrastructure software days.

Training AI on small and biased datasets

In an ideal world, machine learning/artificial intelligence (ML/AI) would be able to train using lots of perfectly representative and well-labeled data. In practice, things are rarely that neat and tidy. Kate Saenko, an associate professor at Boston University, studies the very common situation in which models need to learn from biased and small datasets. “I would challenge you to find me a dataset that isn’t biased,” Saenko said.

For example, Saenko suggested, imagine you have a model that’s been trained using supervised learning with labeled data to recognize pedestrians in a warm climate. Now, try to use that model in New England in winter. Pedestrians are wearing hats and heavy coats. They may be hidden by snow banks. Even if it’s a relatively large dataset, it’s biased towards people in a subset of possible environments.

Saenko and her fellow researchers have primarily applied a technique called adversarial domain alignment to improve classification accuracy when presented with new unlabeled data that’s not representative of the original. They add a new classifier called a domain discriminator to align the two different datasets without requiring new labeled data.

Programmable network-centric infrastructure for research

The distributed workflows day also featured a talk by Ilya Baldin, Director, Network Research and Infrastructure, at RENCI (Renaissance Computing Institute). The subject was FABRIC: an adaptive programmable research infrastructure for computer science and science applications. FABRIC is intended to enable cutting edge and exploratory research at scale in networking, cybersecurity, distributed computing and storage systems, machine learning, and science applications.

There are a number of motivations behind FABRIC. In particular, Baldin foresees that changes in the economics of compute and storage allow for the possibility that a future internet might be more stateful. As Baldin put it, “if we had to build a router from scratch today, it wouldn’t look like the routers that we build now.” Add to this the explosion of new types of compute, like GPUs and FPGAs, a new high-speed intelligent network edge, and new classes of distributed applications. FABRIC should provide new ways to link all these things together.

FABRIC launched in 2019 with a $20 million grant from the National Science Foundation. It’s since been expanded worldwide with a sister project called FABRIC Across Borders (FAB) which will link FABRIC’s nationwide infrastructure with nodes in other countries. It’s intended to give researchers a testbed with network-resident capabilities to explore and anticipate how large quantities of data will be handled and shared among collaborators spanning continents.

Finding software bugs more efficiently

The infrastructure software day led off with a talk by Baishakhi Ray, an assistant professor at Columbia University, on using neural networks to make fuzzing more efficient. A common technique for finding software vulnerabilities, fuzzing is a software testing technique that provides invalid, unexpected, or random data inputs to a program to see if it will crash or otherwise display anomalous behavior.

However, the success of fuzzers can depend on a lot of human judgement, because just using traditional fuzzing techniques can be very inefficient. Take for example the evolutionary techniques that allow the fuzzer to use feedback from each test case to learn the format of the input over time. Even with this relatively advanced technique, the fuzzer can still get stuck in fruitless sequences of random mutations.

Ray’s research proposes a novel smoothing technique using neural network models that can incrementally learn smooth approximations of a complex, real-world program’s branching behaviors. Evaluations of this technique suggest it not only performs faster than existing fuzzer approaches but can also find bugs that other fuzzers don’t.

How to most efficiently schedule multicore systems?

Finally, a talk by Professor Mor Harchol-Balter and PhD student Benjamin Berg of Carnegie Mellon University also looked at performance, but in the context of scheduling on multicore systems.

On such systems, you can dynamically allocate resources to a given job. But should you give each job a lot of resources so it finishes quickly, or should you be more fair about it and assign fewer resources to more jobs? The answer, it turns out, is that it depends on the nature of a given job. In general, most jobs scale less than linearly so using four cores rather than one gives you less than a 4x speedup. But give every job the bare minimum and, while that may be efficient, everything may take a long time to complete—even jobs that could otherwise complete quickly.

Harchol-Balter and Berg’s research is focused on deriving an optimal allocation policy that minimizes mean response time across a set of jobs by balancing the trade-off between granting priority to short jobs and maintaining the overall efficiency of the system.

The recordings of all the sessions are up on research.redhat.com, and we encourage you to give them a view.

SHARE THIS ARTICLE

How expensive is it to crack a password derived with Argon2? Very

Vojtěch Polášek

Passwords made are to be memorable, so they are not usually secure enough for encryption software. That’s where derivation functions come in, transforming a password into a more suitable cryptographic key.

Column

Europe RIG increases collaboration with the European Commission to drive innovation for addressing global challenges

Matej Hrušovský

The year 2022 was exceptional for Red Hat Research in a multitude of ways. After more than two years of only virtual gatherings, we successfully organized Red Hat Research Day Europe, an international in-person event held in Brno, Czech Republic. We further consolidated and aligned our team’s goals with the goals of the company and […]

Feature

Finding bugs in parallel programs with heavy-duty program analysis

Vladimír Štill

Parallelism promises to make programs faster, yet it also opens many new pitfalls and makes testing programs much harder.

News

Linux now includes a real-time analysis toolset

Shaun Strohmer

Daniel Bristot de Oliveira’s research in real-time systems led to the inclusion of the RTLA in Linux 5.17. Red Hat Research’s Dr. Daniel Bristot de Oliveira presented the Real-Time Linux Analysis Toolset (RTLA) at the Red Hat Open Source Summit held June 21- 24, 2022, virtually and in Austin, TX, USA. Daniel’s research was featured in a […]

News

Undergraduate research projects advance the Red Hat Collaboratory’s educational mission

Shaun Strohmer

The Red Hat Collaboratory at Boston University is supporting select undergraduate student research projects during Summer 2022, in keeping with its mission of advancing education in open source technologies. So far, six projects have been chosen to receive funding and supervision from BU computer engineering professors active in their own Collaboratory projects, with more expected. […]

From the Director

Three years of making new mistakes—and some great solutions

Hugh Brock

Three years ago, I opened my first column in the first issue of this magazine by expressing my sense of good fortune at being able to start something completely new: not just a magazine, but an entire organization devoted to research on computer infrastructure done entirely in open source. Looking back on it today through […]

From the Director

Untangling complex systems

Hugh Brock

It is by now well understood that we humans are capable of creating systems that are more complex than we can understand.

Feature

Verification of a Linux distribution

Kamil Dudka

While research on formal verification continues, fully automatic dynamic analysis of RPM packages is now available for Fedora users. In 2019, Red Hat joined the AUFOVER (Automation of Formal Verification) project, which focused on fully automatic detection of bugs in complex software products based on formal verification. The project was driven by Honeywell and supported […]

Feature

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Data skipping and network performance improvement technologies prove their value in data-intensive applications.