Red Hat Research Quarterly

Managing large-scale systems

Red Hat Research Quarterly

Managing large-scale systems

about the author

Hugh Brock

Hugh Brock is the Research Director for Red Hat, coordinating Red Hat research and collaboration with universities, governments, and industry worldwide. A Red Hatter since 2002, Hugh brings intimate knowledge of the complex relationship between upstream projects and shippable products to the task of finding research to bring into the open source world.

Article featured in

I have been spending a lot of time lately thinking about all the hard problems involved in managing large-scale systems. Why? Well, it turns out to be a really important topic for Red Hat Research and for the Red Hat engineering community that we hope to serve. If we are correct that operating large-scale systems will necessarily be the domain of “expert systems” with AI, then we need to understand exactly what we mean by “operating,” at a minimum.

“It will be very interesting to see what needs to happen over time before we can really trust a robot to know which tuning knob to turn…”

I tend to approach these kinds of issues from a typical engineering standpoint. How can I construct the “plumbing” that allows me to get decent data out of a system, in the form of logs, events, metering, and so on? And how can I then add the appropriate controls that let someone or something in possession of that decent data do something useful with it? Unfortunately, as hard as this problem is, it turns out to be just the tip of the iceberg. Sanjay Arora’s interview with computer vision expert Kate Saenko, our cover story for this issue, focuses on the difficulty of training models, neural networks, and the like so that they are generalizable and not “biased”—biased in the sense that they are unable to tell that an orange hanging from a tree in sunlight is the same object as one sitting in a bowl of fruit in candlelight. This lack of generality also affects the AI we train to control systems. It will be very interesting to see what needs to happen over time before we can really trust a robot to know which tuning knob to turn to keep a mission-critical compute cluster running.

A related problem with large-scale systems arises simply because of the quantity of data they generate and the expense of moving all those bits around. For any reasonably large system, some degree of processing will need to take place close to where the data is collected, so that a smaller amount can be sent on to a central processor. Red Hatter Rui Vieira’s article on using Bayesian inference on streaming data is a very deep look at the different methods available to approximate and reduce a very large data flow. I hope to see applications of his work soon.

In addition to training models, we spend a lot of time in this issue on the different ways we train human beings. Check out Tomaš Effenberger’s piece on using microworlds and puzzles to teach kids programming—it’s absolutely fascinating (and almost certainly more effective than the Fortran books I read at age 12). We don’t stop with kids, either. Petr Viktorin writes in this issue about establishing a Python training program for adult women. Through the program he developed, Petr helped a lot of people understand programming, and in return learned a lot from them about agency and motivation. Like children, adults have lots of different reasons to learn. Fortunately, both children and adults learn better than machines—for now, at least.

More like this