What is the gradient of numpy broadcasting?” I was wondering the other night. It’s almost midnight, my brain is tired, and I got a bit stuck.

In short…

In numpy, broadcasting is adding a one-dimentional vector b2 to a 2-D matrix X, which implicitly adds b2 to every row of X.

To compute the derivative of the result over b2 (let’s call it g), we rewrite the expression as matrix multiplication. Here, B2 is a matrix of shape (M,1), which is just b2 with one extra axis, and $1^{B\times 1}$ is the broadcasting matrix:

$$X+b_2 = X + 1^{B\times 1}B_2^\top$$

In order to backpropagate the upstream gradient $G$, it gets multiplied by the transposition of the broadcasting matrix:

$$g^\top = \frac{\partial L}{\partial h_2} = 1^{1\times B} G$$

Effectively, $1^{1\times B}$ acts as a summator along the first axis ($g_k = \sum_{i=1}^B G_{i, k}$), which can be expressed as

Let’s explore how to get there in three different ways: the intuitive, the “compact form” and the “full form” tensor derivatives. But first, how did this even come about?

Midnight homework

Back then, I had just started my CS231n course studies as part of the Stanford SCPD certificate program. CS231n “Convolutional Neural Networks for Visual Recognition” is a great “starter” course if you study Deep Learning. We had to learn the basics and to manually implement and test quite a few backpropagation algoritms for simple network layers. I’ll write a more in-depth overview later.

Importantly, we got to really understand vector operations at a really low level, and we had to use a general purpose math library numpy as oppsoed to Deep Learning-specific Tensorflow or Pytorch.

Broadcasting in a Fully-Connected Layer

When I just started CS231n, I was used to that framework-level thinking about deep learning algorithms, and it took a bit of effort to shake it off. When implementing a two-layer fully-connected neural network using numpy, you add bias vector to the input batch multiplied by the weight matrix:

Here:

  • X1 is a matrix of shape (B, N) where B is the batch size, and N the number of outputs from the first hidden layer.
  • W2 of shape (N, M) where M is the number of outputs from the second hidden layer;
  • b2 is the bias vector of shape (M).
  • h2 of shape (B, M) is the output of the layer for each example in the batch.

But wait. The result of matrix multiplication X1.dot(W2) is of shape (B, M) so how can you add a vector of shape (M) to this? In numpy, as well as in our lecture notes, that’s called “broadcasting” and it’s a convenient method of working with batches of data. The code above essentially means “add b2 to every row of X1.dot(W2) Our lecture notes told us to “use broadcasting” and I did. Neat.

Later, I was implementing backprop and for that, I needed to know the derivative of $\frac{\partial h_2}{\partial b_2}$. That’s when I got confused. What is the gradient of that “broadcasting”?

Intuitive derivation

We can ask a simpler question instead: how does $b_2$ contribute to the eventual loss $L$? Since it independently affects each example in the batch (and this means the impact onto each batch needs to be summed), we could simply compute the gradient of each example, and then sum the results along the batch axis. Tthat seemed too inventive to me at the time. Is there a more “mechanical” way to approach the task?

Broadcasting as matrix multiplication

To answer that, let’s think how to represent broadcasting without knowing that we operate in batches. Apparently, something happens to that b2 vector before it’s added to the product $X_1W_2$ , some function f is applied:

$$h_2 = X_1 W_2 + f(b_2)$$

In order for the addition to make sense, f should produce a matrix of shape (B, M) out of a vector of shape (M), such that each row of $f(b_2)$ equals $b_2$. Could it be matrix multiplication?

Indeed, if $f(b_2) = Zb_2^\top$, then Z is a matrix of shape (B, 1), and we can consider $b_2$ a matrix of shape (M, 1) (let’s actually start calling it $B_2$ since a one-column matrix and a vector are two different things). It makes sense if Z is simply a vector of ones (as in np.ones((1, B))):

$$f(b_2) = 1^{B \times 1} B_2^\top$$

So, the expression for $h_2$ becomes something more manageable:

$$h_2 = X_1 W_2 + 1^{B\times 1}B_2^\top$$

My first instinct was to try to define $\frac{\partial h_2}{\partial B_2}$. As I later learned, that’s not the most effective way: it only makes sense when operating on “full form” derivatives I’ll describe below. Instead of directly computing the “gradient of broadcasting”, we use it in backpropagation from the upstream gradient $G$:

$$\frac{\partial L}{\partial b_2} = \left(\frac{\partial L}{\partial B_2^\top}\right)^\top = \left(\underbrace{\frac{\partial L}{\partial h_2}}_{= G^\top} \frac{\partial h_2}{\partial B_2^\top}\right)^\top = \left(\frac{\partial f(B_2^\top)}{\partial B_2^\top}\right)^\top G = \left(1^{B \times 1}\right)^\top G = 1^{1 \times B}G$$

This is essentially the gradient of the broadcasting applied via the chain rule.

What does the expressions of $1^{1 \times B}G$ mean? If you look at the per-component layout of it, this expression effectively calculates the sum along the batch axis of the right-hand side matrix. Yep, the transposed “broadcasting” matrix becomes the summation matrix, and the end result will be:

$$g_k = \sum_{i=1}^{B}G_{i,k}$$

Compact and Full forms of tensor derivatives

Wait, why is $\frac{\partial L}{\partial h_2}=G^\top$ rather than $G$ ? And why did we end up with a matrix of shape (1, M) instead of the desired (M, 1)? Where all these transposition come from? Well, that’s because we’ve actually been using the “compact form” instead of the “full form” of tensor derivatives.

Yep, the way we usually operate with tensor derivatives, is by using the “compact form”, that omits quite a few zeros. You can read a great overview of this here in the CS231n lecture notes.

The actual “full” Jacobian of two matrices $\frac{\partial \mathbf{A}}{\partial \mathbf{B}}$ would be a four-dimensional array with the shape equivalent to $(a_1, a_2, b_2, b_1)$ (where $A \sim (a_1, a_2)$ and $B\sim(b_1, b_2)$–we’re using $\sim$ for “is of shape” here). However, in deep neural network calculus, one of these dimensions is usually a batch dimention. And batches are independent from one another. So most of the elements in the “full Jacobian” would be 0 (read more here). That’s why everyone uses a compact notation to only list the elements that matter. This requires you to be inventive and “get it”. But for me to “get it” I had to write it out in the full form.

$$\frac{\partial h_2}{\partial b_2} \sim (B, M, M) = (1^{B} \otimes I^{M\times M})$$

you can see how most elements of this matrix (let’s call it $M$) are indeed zeros, except that $M_{i, j, k}=1$ when $j=k$. In fact it only has M nonzero elements. This is consistent with that the “compact form” of $\frac{\partial \cdot}{\partial \mathbf{X}}$ is usually assumed to take shape of $\mathbf{X}$ itself. Compare the “full form”:

$$G_{full} = \left[\frac{\partial L}{\partial h_2}\right]_{full} \sim (1, M, B) $$

with the “compact form” we’re used to:

$$G = \left[\frac{\partial L}{\partial h_2}\right]_{compact} \sim (B, M) $$

Except for that extra axis, $G = G^\top_{full}$; that’s where the transposition comes from.

Overall, one can indeed operate the “full” derivatives in a more algorithmic way, but sadly that would increase the number of dimensions and require way more computation than the compact form.

I need to learn a bit more about how to represent derivatives with tensors and their product before trying to write out the full form for the expression above. I’ll update my post with this when I do: getting outer product for tensors with derivatives isn’t straightforward.

Perhaps there’s a way to both keep it “mechanised” and also simplify these expressions–let me know in the comments!