Getting started with data science and machine learning

Aug 14, 2023 | Blog

by Sanjay Arora, data scientist at Red Hat (originally published April 6, 2022)

Data science has exploded in popularity (and sometimes, hype) in recent years. This has led to an increased interest in learning the subject. With so many possible directions, it can be hard to know where to start. This blog post is here to help.

Machine learning is better considered a loose collection of topics rather than a coherent field. It encompasses topics in many different areas, including

Data storage (databases, data storage technologies)
Data engineering (infrastructure and techniques for transforming data at scale)
Statistics and machine learning
Data visualization and communication

—and many more.

For a beginner, it is crucial to get a flavor of the subject before diving deep and specializing. For anyone curious about learning data science, here is some informal guidance on ways to get started.

Programming languages for machine learning

In terms of programming languages, Python is used heavily in commercial industry and in academic computer science departments where a lot of machine learning (ML) research is carried out. The statistical language R is used heavily by groups doing classical statistics, such as medicine/clinical groups or psychology. R has a very rich set of libraries in this arena, while Python is still lacking—although there are packages like statsmodels that implement classical methods.

That said, when it comes to ML, especially deep learning or reinforcement learning, Python dominates. In these settings, Python is almost always used as a prototyping language; all the core functionality is implemented in a lower-level language, like C, C++, or numerical routines in Fortran. A practitioner might write a neural network using PyTorch or NumPy, which, in turn, call parallelized single instruction, multiple data (SIMD) operations implemented in lower-level languages. Julia is an interesting alternative language, but for those starting out, learning Python is highly recommended.

Which aspects of ML would you like to try?

The day-to-day work of data scientists can be vastly different. The situation is analogous to having the title “software engineer.” Person A might be a low-level kernel hacker, while Person B might be writing front-end code in React. It’s the same job title, but with very different skill sets, even though both can write code.

There are a few things you could try to get a flavor of data science:

Machine learning: The classic course used to be Andrew Ng’s Coursera course, which is still a great starting point. The course gives a great high-level survey of the various core techniques in ML. More importantly, it conveys the kind of mathematical and algorithmic thinking one needs for the design and analysis of ML algorithms. While most practitioners will not need to design new algorithms, understanding the principles is crucial to applying and extending ML techniques. A great way to build this understanding is to implement not only the assignments in the course but also each algorithm from scratch in Python. Two recent books by Kevin Murphy are highly recommended for those seeking to go beyond and dive deeper into ML.

Data manipulation: A huge part of data science work is getting the data in the right structure and format, as well as exploring and checking ideas in the dataset. R has amazing libraries (dplyr) for this, and in the Python world, pandas essentially replicated these functionalities. Similarly, getting comfortable with any plotting library (e.g., matplotlib, seaborn, plotly) will be very useful. NumPy is another essential Python package: a general principle is to replace as many loops in your code as possible with corresponding highly optimized numpy functions. The best way to learn these skills is to pick a dataset from your job or from Kaggle (see below) and start using pandas and matplotlib to explore patterns and hypotheses.

Kaggle: Kaggle is a platform for ML competitions and is a great source for well-defined problems on clean datasets. A good way to apply ML/modeling skills is to pick a Kaggle problem. Pick one that has a tabular dataset—rather than images or text—for your first modeling exercise. Building models for a specific task that can be scored is a great way to learn new modeling techniques. A downside of Kaggle is that most real-world problems are not that well defined and don’t require getting an extra 0.001% accuracy. Models on Kaggle tend to be too complicated (an ensemble of 50 models, for example), but even with these caveats, Kaggle is a great way to learn practical modeling.

Data pipelines: The workflow of an ML project generally consists of data going through a sequence of transformations. Pipelining infrastructure makes it easy to implement these transformations on distributed hardware in a scalable and reliable way. While data engineers are generally responsible for implementing these pipelines, it’s very useful for data scientists to become conversant with pipelining tools too. A popular open source platform for pipelines is Kubeflow Pipelines. The project Operate First currently hosts a service providing Kubeflow Pipelines, which can be used for experimentation.

Domain expertise pays off in data analysis

In almost every scientific field, the role of the data scientist is actually played by a physicist, chemist, psychologist, mathematician (for numerical experiments), or some other domain expert. They have a deep understanding of their field and pick up the necessary techniques to analyze their data. They have a set of questions they want to ask, and they have the knowledge to interpret the results of their models and experiments.

With the increasing popularity of industrial data science and the rise of dedicated data science educational programs, a typical data scientist’s training lacks domain-specific training. This lack of domain understanding strips away a data scientist’s ability to ask meaningful questions of the data or generate new experiments and hypotheses. The only solutions are either to work with a domain expert or, even better, to start learning the field one is interested in. The latter approach does take a long time but pays rich dividends.

ML specializations

There’s also the option of going deep into the techniques in many cases. A big caveat is that some of these are very specialized and generally need a lot of dedicated time. The list below is woefully incomplete and is meant to give a sense of what a few subspecialties of ML involve. Most data scientists will probably never encounter these specializations in their work.

Deep learning: Beyond learning the basics of neural networks and their architectures, deep learning includes learning to devise new ones and understanding the tradeoffs in their design. Diving into deep learning also requires getting comfortable with the tools (e.g., PyTorch, GPU kernels, possibly some C, Julia code) that let one carry out diverse experiments and scale them. There’s also a lot of reading: Papers With Code is a great resource. Note that there are specialized subfields like computer vision, which do a lot more than throw a convolutional neural network at an image.

Reinforcement learning: This is even more specialized than deep learning, but it’s a fast-growing, intellectually rich field. Again, this involves reading and understanding (and implementing) lots of papers, identifying subthreads that one finds interesting, then applying or extending them. Reinforcement learning is generally more mathematical than deep learning. A (non-exhaustive) list of books/resources is:

Reinforcement Learning by Sutton and Barto
A great online course by Sergey Levine at Berkeley
A collection of papers by Pieter Abbeel, also at Berkeley.
NeurIPS 2021 Workshop

Graphical models: Another interesting subfield is that of probabilistic graphical models. Some resources here are:

Statistical Rethinking: This is a great (R-based) book
Pyro: A PyTorch-based library for graphical models

The subfield of optimal statistical decision making (related to reinforcement learning) provides a sense just of how specialized things can get. To learn more, see:

Optimal Statistical Decisions by DeGroot
Bandits by Lattimore and Szepesvári
Reinforcement Learning: Theory and Algorithms by Agarwal, Jiang, Kakade, and Sun

Lastly, a philosophical point: there are two opposing approaches. One is to know which tool to use, pick up a pre-implemented version online, and apply it to one’s problem. This is a very reasonable approach for most practical problems. The other is to deeply understand how and why something works. This approach takes much more time but offers the advantage of modifying or extending the tool to make it more powerful.

The problem with the first approach is that when one doesn’t understand the internals, it’s easy to give up if something doesn’t work. The problem with the second approach is that it is generally much more time consuming (maybe that’s not really a problem) and must be accompanied by application to problems (practical or not) to avoid having just a superficial level of understanding.

My very opinionated advice is to do both. Always apply the techniques to problems. The problems can be artificial, using a synthetically generated dataset, or they can be real. See where they fail and where they succeed. But don’t ignore the math and the fundamentals. The goal is also to understand and not just use, and understanding almost always has some mathematical elements. Initially, it might seem like a foreign language, but eventually it allows one to generate new ideas and see connections that are just hard to see otherwise. Sometimes the mathematics in ML papers can seem gratuitous. Still, even then, it provides a post-hoc justification of observed results and can be used to suggest new extensions of the techniques and new experiments to verify whether the mathematical understanding is correct.

Good luck!

blog

Fedora Linux transition for quantum resistant cryptography

By Dmitry Belyavskiy While numerous robust post-quantum (PQ) standards exist, along with various projects implementing them, widespread adoption for communication and data protection hinges on their integration into mainstream OS distributions. By incorporating these...

Student research spotlight: Jakub Suchánek studies authentication in public open source repositories

Understanding user perception and behavior is often neglected in open source software (OSS) security. Jakub Suchánek, a student of the Faculty of Informatics at Masaryk University, collaborated with Red Hat Research on a project investigating authentication in public...

Intern spotlight: Eric Munson builds guitars and Unikernel Linux

PhD interns at Red Hat Research’s partner universities play a pivotal role in bringing together the cutting-edge thinking of research institutions with the real-world expertise of industry. The PhD program enables long-term research partnerships that provide greater...

Correctness in distributed systems: the case of jgroups-raft

By José Bolina Building distributed systems is complex work, but strong primitives with well-defined guarantees and an expected behavior can make it easier. With stronger guarantees in primitives come strong safety and correctness verification requirements. In some...

Kernel Development Learning Pipeline program brings Linux to college students

By Joel Savitz The operating system is at the center of open source innovation, but a surprising number of college students lack exposure to this domain and, in particular, lack comfort with the Linux kernel. As a result, there’s an industry-wide shortage of qualified...

Co-design research lab accelerates innovation in non-traditional and specialized hardware

By Ahmed Sanaullah In 2023, Red Hat Research announced the launch of the Co-Design (CoDes) research lab during the Massachusetts Open Cloud (MOC) Alliance Workshop. Our goal was to build an ecosystem that could deliver on the immense value proposition of...

An Open vSwitch security feature causes a security problem. Here’s how to prevent it.

By Vašek Šraier Vašek Šraier is a software engineer at Guardsquare working on the security analysis tool AppSweep. He completed his Master's thesis, "Performance of Open vSwitch-based Kubernetes Cluster in Pathological Cases," at Charles University in Prague under the...

Intern Spotlight: Christina Xu, Red Hat Research Boston

At Red Hat Research, we hire creative, passionate students ready to work and learn with a global leader in open source solutions. Our interns bring fresh ideas and new connections to challenging problems in the open source community, unlocking their own potential...

Intern Spotlight: Red Hat course helps students unleash the power of Git

University partnerships fuel the generation of new ideas and opportunities in open source research. In addition to developing research collaborations and assisting with student theses, the Red Hat Research team facilitates teaching opportunities for our engineers....