How to do version control in Machine Learning projects

Rishabh Agrahari
6 min readJun 4, 2021

Machine learning projects can get messy over time, and when you work as a team, it becomes even more challenging. The one thing that sets machine learning projects apart from generic software development environments is the data.

In machine learning projects, “data” is not just limited to the input raw data but also the processed data, model weights, pipelines, and the metrics. Version control with Git is easy when you have your own dedicated Git server, but that’s not always the case. For most of us, we have to turn to standard Git servers like GitHub, GitLab, etc., and they have restrictions on the size of the files. Given that data files in ML projects can quickly go in GBs, we have to live with manually sharing data files among our peers, having multiple versions of data files and code files associated with them and what not; these practices lead to a mess after a while. We can easily avoid this scenario if we could do seamless version control in our machine learning projects just like we do in a typical software dev project, right? This blog will show how we can do just that using Git and DVC. So, grab your seatbelts. 🙂

DVC is an open-source Version Control System for Machine Learning Projects. It is designed to handle large files, data sets, machine learning models, and metrics.

We’ll be using Git to push our ML code to a standard Git Server (GitHub, in this blog) and DVC to push our project-related data and the model weights to a remote data storage server (AWS S3 in this blog, Google Drive and, other cloud storage services can be used as well!). Let’s get started then!

Git Repository Setup

First, we will create a repo on GitHub.

DVC Setup

There’s a one-time setup that needs to be done before we can start using DVC.

  1. Install DVC with pip install dvc
  2. Initialized your project as a DVC repository with dvc init

Once this is done, we need to add cloud storage, which DVC will use to store the data and model weights.

In this blog, I’ll be using AWS S3 as my remote cloud storage. Here’s what you need to do:

  1. Create an S3 Bucket
  2. Make sure you have configured your awscli locally.
  3. Run dvc remote add -d myremote s3://my-remote-s3-bucket-URI this command add your S3 bucket as a remote cloud storage location, aliased as myremote

Phew! Finally, the one-time setup is done, and we are ready to get started with version control!

The Version Control

Let’s write some ML code. What about MNIST classification using Pytorch? It’ll do the following things:

  1. Download the MNIST data
  2. Train a CNN Model
  3. Save the trained model’s weights.

After we are done, this is what our project structure will look like:

Few things to notice here:

  1. The MNIST data is stored at data/MNIST with an overall size of 105MB
  2. Model training and saving code is written in main.py
  3. Trained model weights are stored in the model_weights folder

Running git status will look something like this:

Now, we’d love to push our model training script, main.py, to GitHub but not the data and model weights. This is where DVC comes into the picture. We’ll add our data and model weights folder to DVC, to track any changes made to these folders, using dvc add command. This command is similar to the git add command.

Whatdvc add does?

  1. Asks DVC to track a file/folder
  2. Puts the added file/folder to.gitignore so that Git will ignore these from then on
  3. Creates .dvc file for each of the added file/folder

What are these .dvc files for?

  1. DVC tracks the data using these .dvc files. They contain the necessary meta information of the tracked files.
  2. DVC pushes/pulls the actual data to/from remote cloud storage using these .dvcfiles.
  3. Git tracks these .dvc files instead of the actual data, so these .dvc files will be pushed to GitHub instead of the actual data.

Ingenious, right? 🙂

Take a look at the output of the git status -u command. These many files are going to be tracked by git. Notice .dvc/ folder in the root of the repository. It contains all information about the remote cloud storage and data files that DVC is tracking. Also, we can see that no actual data is being tracked, just the corresponding .dvc files!

Let’s commit and push the code to GitHub!

Check out the updated Github Repository here!

Let’s push the data to remote cloud storage using dvc push command:

This is how our AWS S3 Bucket will look like:

These folders contain our data files, and they have been intentionally named randomly. 🙂

And we are done! The code is on GitHub, and the data is on our remote cloud storage, woohoo! 🥳

What about a code review?

Let’s suppose you ask your peer to review your code and results. Guess how many commands they need to set up the project?

Just two commands, and they are all set for reviewing your code!

  • git clone
  • dvc pull

Please note that your peer needs to have access to the GitHub repo and the AWS S3 bucket set as DVC remote storage, and they should have configured their awscli so that they can pull the data from AWS S3 storage.

Conclusion

We saw how we can use Git and DVC to manage our Machine learning projects effectively. DVC has many more features, like checking out previous versions of data, local branching, and many more. It supports many more cloud storage platforms, like google drive, azure storage, etc. Check them out on the DVC Features page. Do check out absolutely excellent official tutorials of DVC on YouTube.

The End!

About me: Hi, I’m Rishabh. I’m an AI Researcher at Tvarit GmbH, where we are trying to solve complex manufacturing problems using AI. Let’s connect on Twitter @pyags 🙂

--

--

Rishabh Agrahari

I teach machines what to do with their lives. Head of AI Delivery @ Tvarit GmbH