Implementing DenseNet on MURA using PyTorch

Rishabh Agrahari
6 min readFeb 2, 2018

Last December, Standford ML Group released MURA dataset, a large dataset of musculoskeletal radiographs containing 40,895 images from 14,982 studies, where each study is manually labeled by radiologists as either normal or abnormal. One of the largest of its kind. They also developed a 169 layer Dense Convolutional Neural Network to detect and localize abnormalities. The model achieved performance comparable to board-certified radiologists.

Checkout about project MURA here and the research paper: MURA Dataset: Towards Radiologist-Level Abnormality Detection in Musculoskeletal Radiographs. The research paper provides insights of the model architecture, optimization algorithm, learning rate and it’s performance on various study types. They will be releasing the code this month.

Radiographs and their corresponding labels (1 = abnormal, 0 = normal)

What I’m going to do in this post is to imitate the model implemented in the MURA paper using PyTorch. Code is hosted on GitHub here.

Dense Net connects each layer to every other layer in a feed-forward fashion. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

Exploratory Data Analysis

MURA is a dataset of musculoskeletal radiographs consisting of 14,982 studies from 12,251 patients, with a total of 40,895 multi-view radiographic images. Each study belongs to one of seven standard upper extremity radiographic study types: elbow, finger, forearm, hand, humerus, shoulder and wrist.

MURA dataset comes with train, valid and test folders containing corresponding datasets, train.csv and valid.csv contain paths of radiographic images and their labels. Each image is labeled as 1 (abnormal) or 0 (normal) based on whether its corresponding study is negative or positive, respectively. Sometimes, these radiographic images are also referred as views.

Components of train and valid set:

  • train set consists of seven study types namely: XR_ELBOW XR_FINGER XR_FOREARM XR_HAND XR_HUMERUS XR_SHOULDER XR_WRIST
  • Each study type contains several folders named like: patient12104 patient12110 patient12116 patient12122 patient12128 ...
  • These folders are named after patient ids, each of these folders contain one or more study, named like: study1_negative study2_negative study3_positive ...
  • Each of these studys contains one or more radiographs (views or images), named like: image1.png image2.png ...
  • Each view (image) is RGB with pixel range [0, 255] and varies in dimensions.

All above points are true for test set, except the third point, the study folder are named like: study1 study2 ..

Read my full EDA report here.

Building the data pipeline:

According to MURA paper:

The model takes as input one or more views for a study of an upper extremity. On each view, our 169-layer convolutional neural network predicts the probability of abnormality. We compute the overall probability of abnormality for the study by taking the arithmetic mean of the abnormality probabilities output by the network for each image. The model makes the binary prediction of abnormal if the probability of abnormality for the study is greater than 0.5.

So we need to predict the probability of abnormality at study level. If you read the EDA report you would know that each study may have one or more number of views (images). Now we need a study level Data Pipeline, one which returns all images of a study to be fed to the model and respective label of the study. Let’s have a look at required data augmentation, according to paper:

Before feeding images into the network, we normalized each image to have the same mean and standard deviation of images in the ImageNet training set. We then scaled the variable-sized images to 224×224. We augmented the data during training by applying random lateral inversions and rotations.

Thankfully, PyTorch provides easy to use datapipeline and data augmentation modules Dataset and Dataloader.

We will be using only wrist study data for now. check get_study_level_data to know more.

Data pipeline for our model

ImageDataSet prepares dataset, it’s __getitem__ function gets called every time our we try to iterate over our pipeline. It takes a study and stacks all of it’s images in a tensor and returns it in a dict together with label of corresponding study. To implement data augmentation, we use PyTorch’s transform module. We resize the image to 224x224, make random horizontal flips, rotate image (<10), convert it to a tensor and then normalize it according to the mean and standard deviation of ImageNet dataset. get_dataloaders return us dataloaders for train and valid set in a dict.

Building the model:

We used a 169-layer convolutional neural network to predict the probability of abnormality for each image in a study. The network uses a Dense Convolutional Network architecture — detailed in Huang et al. (2016) — which connects each layer to every other layer in a feed-forward fashion to make the optimization of deep networks tractable. We replaced the final fully connected layer with one that has a single output, after which we applied a sigmoid nonlinearity.

The weights of the network were initialized with weights from a model pretrained on ImageNet (Deng et al., 2009).

By default PyTorch has DenseNet implementation, but so as to replace the final fully connected layer with one that has a single output and to initialize the model with weights from a model pretrained on ImageNet, we need to modify the default DenseNet implementation. The modified DenseNet (169 layers Dense CNN) can be found here.

The Loss function:

For each image X of study type T in the training set, we optimized the weighted binary cross entropy loss

L(X, y) = − wT ,1 · y log p(Y = 1|X) −wT ,0 · (1 − y) log p(Y = 0|X),

where y is the label of the study, p(Y = i|X) is the probability that the network assigns to the label i, wT ,1 = |NT |/(|AT | + |NT |), and wT ,0 = |AT |/(|AT | +|NT |) where |AT | and |NT | are the number of abnormal images and normal images of study type T in the training set, respectively.

We can create the custom Loss class for our model using torch.nn.modules.Molude of PyTorch.

Training the model:

Training the model is pretty straight forward, call next on dataloader, it will return a dict with all images of a study and the corresponding label. Feed all the images at a time, i.e., in vectorized way and predict abnormalities on each image, take mean of all the predictions, calculate loss, optimize and repeat.

The network was trained end-to-end using Adam with default parameters β1 = 0.9 and β2 = 0.999 (Kingma & Ba, 2014). We trained the model using minibatches of size 8. We used an initial learning rate of 0.0001 that is decayed by a factor of 10 each time the validation loss plateaus after an epoch, and chose the model with the lowest validation loss.

Pytorch provides ReduceLROnPlateau, we can monitory validation loss and decay learning rate by a factor of 10 each time the validation loss plateaus.

Training function is implemented here.

I’ve a NVIDIA Tesla K80 and it takes approx 30 mins for a epoch on wrist study type.

The whole code is open sourced and can be found here:

I’ve tried to give a basic overview of my code, let me know in comments section if you have any doubts.

Hit 👏 if you liked the post. Enjoy :)

--

--

Rishabh Agrahari

I teach machines what to do with their lives. Head of AI Delivery @ Tvarit GmbH