Highlights from the distributed workflows and infrastructure software tracks
Last September, a quartet of virtual Red Hat Research Days dove into kernel/hardware development, distributed workflows, privacy, and infrastructure software. The virtual conferences also covered some of the research on display in Red Hat Research Quarterly, including James Honaker and Merce Crosas’ work on balancing data sharing and privacy in a recent issue (see issue 2:2, “Voyage into the open dataverse”).
Here, we highlight some of the specific research covered on the distributed workflows and infrastructure software days.
Training AI on small and biased datasets
In an ideal world, machine learning/artificial intelligence (ML/AI) would be able to train using lots of perfectly representative and well-labeled data. In practice, things are rarely that neat and tidy. Kate Saenko, an associate professor at Boston University, studies the very common situation in which models need to learn from biased and small datasets. “I would challenge you to find me a dataset that isn’t biased,” Saenko said.
For example, Saenko suggested, imagine you have a model that’s been trained using supervised learning with labeled data to recognize pedestrians in a warm climate. Now, try to use that model in New England in winter. Pedestrians are wearing hats and heavy coats. They may be hidden by snow banks. Even if it’s a relatively large dataset, it’s biased towards people in a subset of possible environments.
Saenko and her fellow researchers have primarily applied a technique called adversarial domain alignment to improve classification accuracy when presented with new unlabeled data that’s not representative of the original. They add a new classifier called a domain discriminator to align the two different datasets without requiring new labeled data.
Programmable network-centric infrastructure for research
The distributed workflows day also featured a talk by Ilya Baldin, Director, Network Research and Infrastructure, at RENCI (Renaissance Computing Institute). The subject was FABRIC: an adaptive programmable research infrastructure for computer science and science applications. FABRIC is intended to enable cutting edge and exploratory research at scale in networking, cybersecurity, distributed computing and storage systems, machine learning, and science applications.
There are a number of motivations behind FABRIC. In particular, Baldin foresees that changes in the economics of compute and storage allow for the possibility that a future internet might be more stateful. As Baldin put it, “if we had to build a router from scratch today, it wouldn’t look like the routers that we build now.” Add to this the explosion of new types of compute, like GPUs and FPGAs, a new high-speed intelligent network edge, and new classes of distributed applications. FABRIC should provide new ways to link all these things together.
FABRIC launched in 2019 with a $20 million grant from the National Science Foundation. It’s since been expanded worldwide with a sister project called FABRIC Across Borders (FAB) which will link FABRIC’s nationwide infrastructure with nodes in other countries. It’s intended to give researchers a testbed with network-resident capabilities to explore and anticipate how large quantities of data will be handled and shared among collaborators spanning continents.
Finding software bugs more efficiently
The infrastructure software day led off with a talk by Baishakhi Ray, an assistant professor at Columbia University, on using neural networks to make fuzzing more efficient. A common technique for finding software vulnerabilities, fuzzing is a software testing technique that provides invalid, unexpected, or random data inputs to a program to see if it will crash or otherwise display anomalous behavior.
However, the success of fuzzers can depend on a lot of human judgement, because just using traditional fuzzing techniques can be very inefficient. Take for example the evolutionary techniques that allow the fuzzer to use feedback from each test case to learn the format of the input over time. Even with this relatively advanced technique, the fuzzer can still get stuck in fruitless sequences of random mutations.
Ray’s research proposes a novel smoothing technique using neural network models that can incrementally learn smooth approximations of a complex, real-world program’s branching behaviors. Evaluations of this technique suggest it not only performs faster than existing fuzzer approaches but can also find bugs that other fuzzers don’t.
How to most efficiently schedule multicore systems?
Finally, a talk by Professor Mor Harchol-Balter and PhD student Benjamin Berg of Carnegie Mellon University also looked at performance, but in the context of scheduling on multicore systems.
On such systems, you can dynamically allocate resources to a given job. But should you give each job a lot of resources so it finishes quickly, or should you be more fair about it and assign fewer resources to more jobs? The answer, it turns out, is that it depends on the nature of a given job. In general, most jobs scale less than linearly so using four cores rather than one gives you less than a 4x speedup. But give every job the bare minimum and, while that may be efficient, everything may take a long time to complete—even jobs that could otherwise complete quickly.
Harchol-Balter and Berg’s research is focused on deriving an optimal allocation policy that minimizes mean response time across a set of jobs by balancing the trade-off between granting priority to short jobs and maintaining the overall efficiency of the system.
The recordings of all the sessions are up on research.redhat.com, and we encourage you to give them a view.