Getting started with data science and machine learning

Apr 6, 2022 | Blog

by Sanjay Arora, data scientist at Red Hat

Data science has exploded in popularity (and sometimes, hype) in recent years. This has led to an increased interest in learning the subject. With so many possible directions, it can be hard to know where to start. This blog post is here to help.

Machine learning is better considered a loose collection of topics rather than a coherent field. It encompasses topics in many different areas, including

  • Data storage (databases, data storage technologies)
  • Data engineering (infrastructure and techniques for transforming data at scale)
  • Statistics and machine learning
  • Data visualization and communication

—and many more. 

For a beginner, it is crucial to get a flavor of the subject before diving deep and specializing. For anyone curious about learning data science, here is some informal guidance on ways to get started.

Programming languages for machine learning

In terms of programming languages, Python is used heavily in commercial industry and in academic computer science departments where a lot of machine learning (ML) research is carried out. The statistical language R is used heavily by groups doing classical statistics, such as medicine/clinical groups or psychology. R has a very rich set of libraries in this arena, while Python is still lacking—although there are packages like statsmodels that implement classical methods. 

That said, when it comes to ML, especially deep learning or reinforcement learning, Python dominates. In these settings, Python is almost always used as a prototyping language; all the core functionality is implemented in a lower-level language, like C, C++, or numerical routines in Fortran. A practitioner might write a neural network using PyTorch or NumPy, which, in turn, call parallelized single instruction, multiple data (SIMD) operations implemented in lower-level languages. Julia is an interesting alternative language, but for those starting out, learning Python is highly recommended.

Which aspects of ML would you like to try?

The day-to-day work of data scientists can be vastly different. The situation is analogous to having the title “software engineer.” Person A might be a low-level kernel hacker, while Person B might be writing front-end code in React. It’s the same job title, but with very different skill sets, even though both can write code.

There are a few things you could try to get a flavor of data science:

Machine learning: The classic course used to be Andrew Ng’s Coursera course, which is still a great starting point. The course gives a great high-level survey of the various core techniques in ML. More importantly, it conveys the kind of mathematical and algorithmic thinking one needs for the design and analysis of ML algorithms. While most practitioners will not need to design new algorithms, understanding the principles is crucial to applying and extending ML techniques. A great way to build this understanding is to implement not only the assignments in the course but also each algorithm from scratch in Python. Two recent books by Kevin Murphy are highly recommended for those seeking to go beyond and dive deeper into ML.

Data manipulation: A huge part of data science work is getting the data in the right structure and format, as well as exploring and checking ideas in the dataset. R has amazing libraries (dplyr) for this, and in the Python world, pandas essentially replicated these functionalities. Similarly, getting comfortable with any plotting library (e.g., matplotlib, seaborn, plotly) will be very useful. NumPy is another essential Python package: a general principle is to replace as many loops in your code as possible with corresponding highly optimized numpy functions. The best way to learn these skills is to pick a dataset from your job or from Kaggle (see below) and start using pandas and matplotlib to explore patterns and hypotheses.

Kaggle: Kaggle is a platform for ML competitions and is a great source for well-defined problems on clean datasets. A good way to apply ML/modeling skills is to pick a Kaggle problem. Pick one that has a tabular dataset—rather than images or text—for your first modeling exercise. Building models for a specific task that can be scored is a great way to learn new modeling techniques. A downside of Kaggle is that most real-world problems are not that well defined and don’t require getting an extra 0.001% accuracy. Models on Kaggle tend to be too complicated (an ensemble of 50 models, for example), but even with these caveats, Kaggle is a great way to learn practical modeling.

Data pipelines: The workflow of an ML project generally consists of data going through a sequence of transformations. Pipelining infrastructure makes it easy to implement these transformations on distributed hardware in a scalable and reliable way. While data engineers are generally responsible for implementing these pipelines, it’s very useful for data scientists to become conversant with pipelining tools too. A popular open source platform for pipelines is Kubeflow Pipelines. The project Operate First currently hosts a service providing Kubeflow Pipelines, which can be used for experimentation. 

Domain expertise pays off in data analysis

In almost every scientific field, the role of the data scientist is actually played by a physicist, chemist, psychologist, mathematician (for numerical experiments), or some other domain expert. They have a deep understanding of their field and pick up the necessary techniques to analyze their data. They have a set of questions they want to ask, and they have the knowledge to interpret the results of their models and experiments.

With the increasing popularity of industrial data science and the rise of dedicated data science educational programs, a typical data scientist’s training lacks domain-specific training. This lack of domain understanding strips away a data scientist’s ability to ask meaningful questions of the data or generate new experiments and hypotheses. The only solutions are either to work with a domain expert or, even better, to start learning the field one is interested in. The latter approach does take a long time but pays rich dividends.

ML specializations 

There’s also the option of going deep into the techniques in many cases. A big caveat is that some of these are very specialized and generally need a lot of dedicated time.  The list below is woefully incomplete and is meant to give a sense of what a few subspecialties of ML involve. Most data scientists will probably never encounter these specializations in their work.

Deep learning: Beyond learning the basics of neural networks and their architectures, deep learning includes learning to devise new ones and understanding the tradeoffs in their design. Diving into deep learning also requires getting comfortable with the tools (e.g., PyTorch, GPU kernels, possibly some C, Julia code) that let one carry out diverse experiments and scale them. There’s also a lot of reading: Papers With Code is a great resource. Note that there are specialized subfields like computer vision, which do a lot more than throw a convolutional neural network at an image. 

Reinforcement learning: This is even more specialized than deep learning, but it’s a fast-growing, intellectually rich field. Again, this involves reading and understanding (and implementing) lots of papers, identifying subthreads that one finds interesting, then applying or extending them. Reinforcement learning is generally more mathematical than deep learning. A (non-exhaustive) list of books/resources is:

Graphical models: Another interesting subfield is that of probabilistic graphical models. Some resources here are:

The subfield of optimal statistical decision making (related to reinforcement learning) provides a sense just of how specialized things can get. To learn more, see:

Lastly, a philosophical point: there are two opposing approaches. One is to know which tool to use, pick up a pre-implemented version online, and apply it to one’s problem. This is a very reasonable approach for most practical problems. The other is to deeply understand how and why something works. This approach takes much more time but offers the advantage of modifying or extending the tool to make it more powerful. 

The problem with the first approach is that when one doesn’t understand the internals, it’s easy to give up if something doesn’t work. The problem with the second approach is that it is generally much more time consuming (maybe that’s not really a problem) and must be accompanied by application to problems (practical or not) to avoid having just a superficial level of understanding.

My very opinionated advice is to do both. Always apply the techniques to problems. The problems can be artificial, using a synthetically generated dataset, or they can be real. See where they fail and where they succeed. But don’t ignore the math and the fundamentals. The goal is also to understand and not just use, and understanding almost always has some mathematical elements. Initially, it might seem like a foreign language, but eventually it allows one to generate new ideas and see connections that are just hard to see otherwise. Sometimes the mathematics in ML papers can seem gratuitous. Still, even then, it provides a post-hoc justification of observed results and can be used to suggest new extensions of the techniques and new experiments to verify whether the mathematical understanding is correct.

Good luck!

Related Stories

Red Hat Collaboratory Announces 2022 Student Research Projects

Red Hat Collaboratory Announces 2022 Student Research Projects

On April 14, the Red Hat Collaboratory announced five newly funded Student Research Projects. As part of Boston University’s expanded partnership with Red Hat, the Student Research Projects aim to provide BU students with research and experiential learning...

Robust LSM tuning research paper published

Robust LSM tuning research paper published

Congratulations to Andy Hyunh, IBM PhD Fellowship intern and PhD candidate, Boston University; Harshal Chaudhari, PhD candidate, Boston University; Evimaria Terzi, Professor, Computer Science, Boston University; and Manos Athanassoulis, Assistant Professor, Computer...

Technical Report: Benchmarking tunnel and encryption methodologies in cloud environments

Technical Report: Benchmarking tunnel and encryption methodologies in cloud environments

In this report, we benchmark the performance of various tunneling technologies to provide directions on their use in multi-cloud deployments. Based on the various experiments conducted on three different testbeds, we present quantifiable data which can be leveraged by operators and providers tasked with design and development decisions of multi-cloud providers and orchestrators.”

Beyond Cyber 01 course with The Academic College of Tel Aviv-Yaffo

Beyond Cyber 01 course with The Academic College of Tel Aviv-Yaffo

Red Hat Project Security engineers Haim Krasniker, Or Asaf, and Luiza Nacshon have completed the first Red Hat Beyond-Cyber-01 course with MTA (The Academic College of Tel Aviv-Yaffo). The course took place in the Ra’anana Red Hat Office, which allowed teachers and students the opportunity to collaborate face to face.

Encouraging mentees to thrive: How to be a good mentor

Encouraging mentees to thrive: How to be a good mentor

An internship is a great opportunity for a company to evaluate candidates in a time-limited job role, but it is also a chance for interns to learn, gain work experience, and evaluate that company as a potential future employer; this is an equally important part of the position.