The math behind Backprop gradients which nobody talks about

4 min readJun 10, 2019

People don’t really go in depth when it comes to calculating gradients during Backpropogation using vector calculus. Usually, there are some shortcuts which get the job done at the expense of indepth conceptual understanding.

Suppose we have a simple neural network, the equations of which look like:

Where (in order of variable: name, shape) :

x: input vector, m x 1

W: weight matrix, n x m

b: bias vector, n x 1

σ : activation function, say, Sigmoid.

a: activation vector, n x 1

y: ground truth label, n x 1

C: cost function, 1 x 1

It’s a fairly simple model, here’s the expression of grad(W)

The shape of grad (W) is going to be n x m

Most of the tutorials online use tricks and shortcuts to find out individual elements of above expression, here we are going to use standard operations of vector calculus to derive grad(W).

C is a function mapping a 1 x n shaped vector a to a scalar value, so dC/da is going to be a Jacobian vector of shape 1 x n

Coming to da/dz both a and z are vectors of shape n x 1, So, it’s gonna be a matrix of shape n x n:

Coming to dz/dW z is a vector of shape n x 1 and W is a matrix of shape n x m so, it’s going to a tensor of shape n x (n x m)

We know that,

The above tensor contains n matrices of n x m shape, each of which is derived by differentiating corresponding row of z vector w.r.t each element of W.

So, now we have:

dC/da of shape 1 x n

da/dz of shape n x n

dz/dW of shape n x (n x m)

We have to multiply each of the above three gradients in order to get grad(W), notice that da/dz and dz/dW are 2D and 3D respectively, How do we multiply a 2D matrix to a 3D matrix?

The answer lies in vector space isomorphism:

Therefore, we can re-shape our n x (n x m) matrix into n x nm matrix and once we are done with our multiplication we re-shape the resulting matrix to n x m

dC/da * da/dz * dz/dW = 1 x n * n x n * n x nm

=> 1 x nm => n x m

This is a fairly simple example, we can extend the idea to more complicated cases. Here’s an awesome book on Maths for Machine learning, give it a try: https://mml-book.github.io/

The math behind Backprop gradients which nobody talks about

Written by Rishabh Agrahari