How to do version control in Machine Learning projects
Machine learning projects can get messy over time, and when you work as a team, it becomes even more challenging. The one thing that sets machine learning projects apart from generic software development environments is the data.
In machine learning projects, “data” is not just limited to the input raw data but also the processed data, model weights, pipelines, and the metrics. Version control with Git is easy when you have your own dedicated Git server, but that’s not always the case. For most of us, we have to turn to standard Git servers like GitHub, GitLab, etc., and they have restrictions on the size of the files. Given that data files in ML projects can quickly go in GBs, we have to live with manually sharing data files among our peers, having multiple versions of data files and code files associated with them and what not; these practices lead to a mess after a while. We can easily avoid this scenario if we could do seamless version control in our machine learning projects just like we do in a typical software dev project, right? This blog will show how we can do just that using Git and DVC. So, grab your seatbelts. 🙂
DVC is an open-source Version Control System for Machine Learning Projects. It is designed to handle large files, data sets, machine learning models, and metrics.
We’ll be using Git to push our ML code to a standard Git Server (GitHub, in this blog) and DVC to push our project-related data and the model weights to a remote data storage server (AWS S3 in this blog, Google Drive and, other cloud storage services can be used as well!). Let’s get started then!
Git Repository Setup
First, we will create a repo on GitHub.
DVC Setup
There’s a one-time setup that needs to be done before we can start using DVC.
- Install DVC with
pip install dvc
- Initialized your project as a DVC repository with
dvc init
Once this is done, we need to add cloud storage, which DVC will use to store the data and model weights.
In this blog, I’ll be using AWS S3 as my remote cloud storage. Here’s what you need to do:
- Create an S3 Bucket
- Make sure you have configured your awscli locally.
- Run
dvc remote add -d myremote s3://my-remote-s3-bucket-URI
this command add your S3 bucket as a remote cloud storage location, aliased asmyremote
Phew! Finally, the one-time setup is done, and we are ready to get started with version control!
The Version Control
Let’s write some ML code. What about MNIST classification using Pytorch? It’ll do the following things:
- Download the MNIST data
- Train a CNN Model
- Save the trained model’s weights.
After we are done, this is what our project structure will look like:
Few things to notice here:
- The MNIST data is stored at
data/MNIST
with an overall size of 105MB - Model training and saving code is written in
main.py
- Trained model weights are stored in the
model_weights
folder
Running git status
will look something like this:
Now, we’d love to push our model training script, main.py
, to GitHub but not the data and model weights. This is where DVC comes into the picture. We’ll add our data and model weights folder to DVC, to track any changes made to these folders, using dvc add
command. This command is similar to the git add
command.
Whatdvc add
does?
- Asks DVC to track a file/folder
- Puts the added file/folder to
.gitignore
so that Git will ignore these from then on - Creates
.dvc
file for each of the added file/folder
What are these .dvc
files for?
- DVC tracks the data using these
.dvc
files. They contain the necessary meta information of the tracked files. - DVC pushes/pulls the actual data to/from remote cloud storage using these
.dvc
files. - Git tracks these
.dvc
files instead of the actual data, so these.dvc
files will be pushed to GitHub instead of the actual data.
Ingenious, right? 🙂
Take a look at the output of the git status -u
command. These many files are going to be tracked by git. Notice .dvc/
folder in the root of the repository. It contains all information about the remote cloud storage and data files that DVC is tracking. Also, we can see that no actual data is being tracked, just the corresponding .dvc
files!
Let’s commit and push the code to GitHub!
Check out the updated Github Repository here!
Let’s push the data to remote cloud storage using dvc push
command:
This is how our AWS S3 Bucket will look like:
These folders contain our data files, and they have been intentionally named randomly. 🙂
And we are done! The code is on GitHub, and the data is on our remote cloud storage, woohoo! 🥳
What about a code review?
Let’s suppose you ask your peer to review your code and results. Guess how many commands they need to set up the project?
Just two commands, and they are all set for reviewing your code!
git clone
dvc pull
Please note that your peer needs to have access to the GitHub repo and the AWS S3 bucket set as DVC remote storage, and they should have configured their awscli so that they can pull the data from AWS S3 storage.
Conclusion
We saw how we can use Git and DVC to manage our Machine learning projects effectively. DVC has many more features, like checking out previous versions of data, local branching, and many more. It supports many more cloud storage platforms, like google drive, azure storage, etc. Check them out on the DVC Features page. Do check out absolutely excellent official tutorials of DVC on YouTube.
The End!
About me: Hi, I’m Rishabh. I’m an AI Researcher at Tvarit GmbH, where we are trying to solve complex manufacturing problems using AI. Let’s connect on Twitter @pyags 🙂