by Sanjay Arora, data scientist at Red Hat (originally published April 6, 2022)
Data science has exploded in popularity (and sometimes, hype) in recent years. This has led to an increased interest in learning the subject. With so many possible directions, it can be hard to know where to start. This blog post is here to help.
Machine learning is better considered a loose collection of topics rather than a coherent field. It encompasses topics in many different areas, including
- Data storage (databases, data storage technologies)
- Data engineering (infrastructure and techniques for transforming data at scale)
- Statistics and machine learning
- Data visualization and communication
—and many more.
For a beginner, it is crucial to get a flavor of the subject before diving deep and specializing. For anyone curious about learning data science, here is some informal guidance on ways to get started.
Programming languages for machine learning
In terms of programming languages, Python is used heavily in commercial industry and in academic computer science departments where a lot of machine learning (ML) research is carried out. The statistical language R is used heavily by groups doing classical statistics, such as medicine/clinical groups or psychology. R has a very rich set of libraries in this arena, while Python is still lacking—although there are packages like statsmodels that implement classical methods.
That said, when it comes to ML, especially deep learning or reinforcement learning, Python dominates. In these settings, Python is almost always used as a prototyping language; all the core functionality is implemented in a lower-level language, like C, C++, or numerical routines in Fortran. A practitioner might write a neural network using PyTorch or NumPy, which, in turn, call parallelized single instruction, multiple data (SIMD) operations implemented in lower-level languages. Julia is an interesting alternative language, but for those starting out, learning Python is highly recommended.
Which aspects of ML would you like to try?
The day-to-day work of data scientists can be vastly different. The situation is analogous to having the title “software engineer.” Person A might be a low-level kernel hacker, while Person B might be writing front-end code in React. It’s the same job title, but with very different skill sets, even though both can write code.
There are a few things you could try to get a flavor of data science:
Machine learning: The classic course used to be Andrew Ng’s Coursera course, which is still a great starting point. The course gives a great high-level survey of the various core techniques in ML. More importantly, it conveys the kind of mathematical and algorithmic thinking one needs for the design and analysis of ML algorithms. While most practitioners will not need to design new algorithms, understanding the principles is crucial to applying and extending ML techniques. A great way to build this understanding is to implement not only the assignments in the course but also each algorithm from scratch in Python. Two recent books by Kevin Murphy are highly recommended for those seeking to go beyond and dive deeper into ML.
Data manipulation: A huge part of data science work is getting the data in the right structure and format, as well as exploring and checking ideas in the dataset. R has amazing libraries (dplyr) for this, and in the Python world, pandas essentially replicated these functionalities. Similarly, getting comfortable with any plotting library (e.g., matplotlib, seaborn, plotly) will be very useful. NumPy is another essential Python package: a general principle is to replace as many loops in your code as possible with corresponding highly optimized numpy functions. The best way to learn these skills is to pick a dataset from your job or from Kaggle (see below) and start using pandas and matplotlib to explore patterns and hypotheses.
Kaggle: Kaggle is a platform for ML competitions and is a great source for well-defined problems on clean datasets. A good way to apply ML/modeling skills is to pick a Kaggle problem. Pick one that has a tabular dataset—rather than images or text—for your first modeling exercise. Building models for a specific task that can be scored is a great way to learn new modeling techniques. A downside of Kaggle is that most real-world problems are not that well defined and don’t require getting an extra 0.001% accuracy. Models on Kaggle tend to be too complicated (an ensemble of 50 models, for example), but even with these caveats, Kaggle is a great way to learn practical modeling.
Data pipelines: The workflow of an ML project generally consists of data going through a sequence of transformations. Pipelining infrastructure makes it easy to implement these transformations on distributed hardware in a scalable and reliable way. While data engineers are generally responsible for implementing these pipelines, it’s very useful for data scientists to become conversant with pipelining tools too. A popular open source platform for pipelines is Kubeflow Pipelines. The project Operate First currently hosts a service providing Kubeflow Pipelines, which can be used for experimentation.
Domain expertise pays off in data analysis
In almost every scientific field, the role of the data scientist is actually played by a physicist, chemist, psychologist, mathematician (for numerical experiments), or some other domain expert. They have a deep understanding of their field and pick up the necessary techniques to analyze their data. They have a set of questions they want to ask, and they have the knowledge to interpret the results of their models and experiments.
With the increasing popularity of industrial data science and the rise of dedicated data science educational programs, a typical data scientist’s training lacks domain-specific training. This lack of domain understanding strips away a data scientist’s ability to ask meaningful questions of the data or generate new experiments and hypotheses. The only solutions are either to work with a domain expert or, even better, to start learning the field one is interested in. The latter approach does take a long time but pays rich dividends.
ML specializations
There’s also the option of going deep into the techniques in many cases. A big caveat is that some of these are very specialized and generally need a lot of dedicated time. The list below is woefully incomplete and is meant to give a sense of what a few subspecialties of ML involve. Most data scientists will probably never encounter these specializations in their work.
Deep learning: Beyond learning the basics of neural networks and their architectures, deep learning includes learning to devise new ones and understanding the tradeoffs in their design. Diving into deep learning also requires getting comfortable with the tools (e.g., PyTorch, GPU kernels, possibly some C, Julia code) that let one carry out diverse experiments and scale them. There’s also a lot of reading: Papers With Code is a great resource. Note that there are specialized subfields like computer vision, which do a lot more than throw a convolutional neural network at an image.
Reinforcement learning: This is even more specialized than deep learning, but it’s a fast-growing, intellectually rich field. Again, this involves reading and understanding (and implementing) lots of papers, identifying subthreads that one finds interesting, then applying or extending them. Reinforcement learning is generally more mathematical than deep learning. A (non-exhaustive) list of books/resources is:
- Reinforcement Learning by Sutton and Barto
- A great online course by Sergey Levine at Berkeley
- A collection of papers by Pieter Abbeel, also at Berkeley.
- NeurIPS 2021 Workshop
Graphical models: Another interesting subfield is that of probabilistic graphical models. Some resources here are:
- Statistical Rethinking: This is a great (R-based) book
- Pyro: A PyTorch-based library for graphical models
The subfield of optimal statistical decision making (related to reinforcement learning) provides a sense just of how specialized things can get. To learn more, see:
- Optimal Statistical Decisions by DeGroot
- Bandits by Lattimore and Szepesvári
- Reinforcement Learning: Theory and Algorithms by Agarwal, Jiang, Kakade, and Sun
Lastly, a philosophical point: there are two opposing approaches. One is to know which tool to use, pick up a pre-implemented version online, and apply it to one’s problem. This is a very reasonable approach for most practical problems. The other is to deeply understand how and why something works. This approach takes much more time but offers the advantage of modifying or extending the tool to make it more powerful.
The problem with the first approach is that when one doesn’t understand the internals, it’s easy to give up if something doesn’t work. The problem with the second approach is that it is generally much more time consuming (maybe that’s not really a problem) and must be accompanied by application to problems (practical or not) to avoid having just a superficial level of understanding.
My very opinionated advice is to do both. Always apply the techniques to problems. The problems can be artificial, using a synthetically generated dataset, or they can be real. See where they fail and where they succeed. But don’t ignore the math and the fundamentals. The goal is also to understand and not just use, and understanding almost always has some mathematical elements. Initially, it might seem like a foreign language, but eventually it allows one to generate new ideas and see connections that are just hard to see otherwise. Sometimes the mathematics in ML papers can seem gratuitous. Still, even then, it provides a post-hoc justification of observed results and can be used to suggest new extensions of the techniques and new experiments to verify whether the mathematical understanding is correct.
Good luck!