Distributed machine learning training on OpenShift

This is a Bachelor level project.

Project Thoth is a recommendation engine for Python software stacks. As Python is becoming de-facto the language of choice for many data scientists and machine learning engineers, Thoth aggregates information about popular Python machine learning and data science packages, such as TensorFlow, Pytorch and many others – this aggregation is done on different levels. One such data aggregation is done in a pipeline that is capable of installing and testing Python packages into different operating systems with a different set of software present in it (different versions of Python interpreter, different versions of glibc, …).

The goal of this thesis is to evaluate how popular machine learning applications and their deployability on OpenShift.

Thesis overview:

  1. Get familiar with project Thoth and its goals.
  2. Get familiar with OpenShift – how to build and deploy applications on OpenShift
  3. Get familiar with popular AI/ML applications, such as TensorFlow, and try to deploy TensorFlow onto an OpenShift cluster
  4. Try to experiment with different Multi-worker training techniques as offered by libraries
  5. Try to come up with a reference architecture for deploying ML applications for distributed ML training
  6. Evaluate how the deployment could be adjusted and additionally modified with respect to performance

Status

Project Resources

RIGs

Affiliations