Explained: DevOps vs MLOps
Updated: 3 days ago
If you’ve explored the tech space lately, you’ll have undoubtedly heard the term DevOps being thrown around. If you’ve been reading any data science or machine learning articles, you might have come across the term MLOps as well. Are you unsure what either of these terms mean, or how they relate to one another? We’ve got you covered!
In this article, we provide a simple (yet exhaustive) overview of what DevOps and MLOps are. We’ll also go over some commonly encountered problems, as well as their solutions.
What is DevOps?
DevOps is shorthand for Development Operations. In order to properly explain it, let’s first describe the problems it attempts to solve.
Say you have a software team working on a web application. This web app has a release cycle of two weeks meaning that every two weeks, the software team will either add new features, improve performance or fix nasty little bugs here and there. Here are some of the challenges the team might face in their day-to-day work:
Manually run tests whenever new changes are done;
Deploy these new changes;
Rollback to an earlier version in case a major bug occurs;
Setup monitoring in case the app crashes for no reason;
Provision or modify infrastructure (servers, database,...);
Many other examples…
These are just a subset of problems that might occur and, as you can guess, each one of these is a time consuming hassle that can severely hinder the web app’s release cycle.
This is where the concept of DevOps comes in. DevOps is, essentially, a set of practices that aims to ease the development process and make deployments as seamless and hands off as possible. By automating the time consuming and repetitive tasks, the development team is left to focus on their actual product. The following are examples of DevOps practices that aim to achieve just that.
Using a Version Control System (VCS):
Version Control Systems (VCS) are made to allow developers to easily store and collaborate on their code. It also acts as a storage for historical changes, providing the ability to rollback to older code versions if needed.
Examples of VCS are Git, SVN, Mercurial and Perforce.
Implementing CI/CD pipelines:
CI/CD (Continuous Integration / Continuous Deployment) pipelines are a set of steps that are run whenever changes are made to the codebase. For example, you could have a CI/CD pipeline setup to do the following:
Build / compile the application;
Run any necessary tests;
Deploy the compiled application to production;
Validate that the deployed application is running correctly.
In short, this both ensures the quality and the velocity of going from development to production. With a solid CI/CD pipeline in place, developers can make changes and have them deployed with a press of a button.
Examples of tools to build CI/CD pipelines are Github Actions, Gitlab CI, Jenkins and CircleCI.
Infrastructure as code:
Usually, if you want to deploy anything to be used by end users on the web, you’d have to provision the necessary virtual machines, databases and so on. If you’re using a cloud provider (AWS / GCP / Azure), that means having to navigate through their consumer portal and buying these instances manually.
This might be okay if all you’re doing is provisioning one virtual machine but, if you need multiple instances for multiple environments, you’re going to be doing a lot of clicking.
This is where Infrastructure as code (IaC) comes in. Following the DevOps mantra, IaC aims to solve the problem of automatically provisioning infrastructure. Instead of having to use your cloud provider’s portal, IaC are simply scripts where you define the Infrastructure that you need and will be automatically provisioned for you when ran. This ensures that if you ever need to rebuild your infrastructure, you can do so with a click of a button. If you add a new development environment with the same specs, that’s also an easy task.
To recap, IaC automates the provisioning of infrastructure to make it faster, more reliable and less prone to human errors.
Examples of IaC tools are Terraform and AWS CloudFormation
Monitoring and Observability:
Say your application, which is now running in production, suddenly crashes. To be able to diagnose the problem, developers would have to crawl through various parts of the system to find the failing component. This process is made much easier if there are adequate monitoring and observability tools set in place. These tools collect and store the following information (these are just examples, there’s potentially much more you can collect based on your application):
CPU and RAM usage;
Specific Metrics (KPIs, number of failed requests over time, etc…);
Read / Write Speeds of Disk.
Examples of monitoring and observability tools are Grafana, Elasticsearch, Kibana and Prometheus.
What is MLOps?
MLOps is shorthand for Machine Learning Operations. If DevOps was focused on making the development and deployment of software applications easier, MLOps does just that but for machine learning applications.
Now, you might ask yourself, what is the difference between software and machine learning applications? Well, the answer is data.
You see, software applications are developed by writing code. Machine learning applications, however, require both code and data to be developed. This extra requirement adds yet another dimension to the typical problems engineers face.
Of course, the addition of data isn’t the only factor; there are other differences as well. In the next sections, we’re going to explore the problems organizations might face when deploying machine learning models and examples of the types of solutions they might employ.
When attempting to describe the problems MLOps tackles, I tend to split the phases typical machine learning applications go through into two: The Development Phase where the data gathering, prepping, and model training occurs followed by the Production Phase where the infrastructure provisioning, model deployment and monitoring occurs.
Both of these phases have their own problems. These include:
Development Phase Problems
Reproducibility issues: machine learning applications depend not only on code but on data as well;
Special infrastructure needs: Some machine learning algorithms require GPUs to be trained in a sensible duration;
Scaling development workflow: to train multiple models at once.
Production Phase Problems
Models created by data scientists usually require an additional API layer to be able to communicate with the model. This needs to be developed by software engineers resulting in a time consuming handoff period;
Lack of oversight of what goes in and out of the deployed model;
Decrease in model’s live performance due to change in data trends in production;
Repeating the entire process and retraining models whenever new data comes in.
Data + Code Versioning:
As previously mentioned, machine learning applications rely on both code and data. Hence, in order to ensure reproducibility, it would be wise to version both of these.
An example of a tool that does this perfectly is MLflow. It keeps track of all your machine learning experiments and versions of both the code and data you’ve used in each one – an excellent tool!
If CI / CD Pipelines were meant for code, Data Pipelines were meant for data. The purpose of these pipelines is to trigger a training process whenever new data comes in. An example data pipeline could do the following:
Fetch the latest data;
Clean and run feature engineering;
Train multiple machine learning models at the same time;
Evaluate the resulting models;
Deploy the best resulting model.
When set up properly, data pipelines are an extremely powerful asset.Examples of tools to build data pipelines are Kubeflow and Apache Airflow.
Monitoring and Observability of ML Metrics:
In addition to monitoring CPU and RAM usage of the deployed model, it’s also vital to monitor more machine learning specific metrics to keep track of our models performance. These might include the model’s accuracy, error rate, precision, recall, data drift,...etc.
The tools used here are similar to those suggested in the DevOps Section with further configurations to monitor ML-specific metrics: Grafana, Elasticsearch, Kibana and Prometheus.
Hopefully by now you have a better picture of what DevOps and MLOps are. It’s best to keep in mind that both these fields are ever evolving with new technologies popping up, especially in MLOps, which brings us to the next section:
Konan: Machine Learning Deployments Made Easy!
Being a Data Science and AI company, we’ve found the need for integrating MLOps in our own ML lifecycle. We’ve created Konan, an MLOps platform focused on deploying, monitoring and maintaining your ML models. Essentially, we’re aiming to make the Production Phase of the machine learning lifecycle as hassle-free as possible. Here’s a sample of what Konan offers:
Automatic provisioning and maintenance of infrastructure;
Automatic API creation with the best security practices;
Monitoring of everything that goes into and out of your model;
Insight over your model’s performance over time;
Retraining your models in a pinch;
Head over to Konan’s page on our website to know more or sign up for a free trial.
Sculley, D, et al. “Hidden Technical Debt in Machine Learning Systems.” 2014, https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
Mikeloukides. “What Is Devops?” O'Reilly Radar, http://radar.oreilly.com/2012/06/what-is-devops.html