MSE with ReLU using OPE Prop

The ReLU function is one of the most well-known and used activation functions, and paired together with MSE, the combination is very suitable for any regression and linear models. This page discusses the formulas of OPE Prop for the MSE and ReLU function pair. Let's begin!

Here are the final formulas:

w_k = nΣr=1 i_rk (y_r - n_wΣg=1,g≠k w_gi_rg - b ) × z(i_r) nΣr=1 (i_rk)²z(i_r)

b = nΣr=1 (y_r - n_wΣg=1 w_gi_rg ) × z(i_r) nΣr=1 z(i_r)

Where z, for simplicity, is:

z(x) = 1 + sign(p(x))

Looking at the final formulas at the start can be important in some cases, or if you are just not interested in the full details. This next section will delve into how the formulas for MSE and ReLU were derived, so if you want, I encourage you to dive in!

Firstly, if we assume that:

i - is the matrix of all input entries, where the first dimension is the data row, and the second is the individual input

y - is a vector of answers

n - is the length of the dataset

w - is a vector of weights

n_w - is the number of weights

b - is the bias

And represent the ReLU function as:

a(x) = x + |x|2

The derivative of which is:

a'(x) = 1 + sign(x)2

Then using the OPE Prop formula for MSE and any activation function, which is:

nΣr=1(a( p(i_r) ) - y_r) × a'( p(i_r) ) × p'(i_r) = 0

And substituting 'a' for ReLU, we get:

nΣr=1 ( x + |x|2 - y_r ) × 1 + sign( p(i_r) )2 × p'(i_r) = 0

Which simplifies to

nΣr=1 ( p(i_r)p'(i_r) - y_rp'(_r) )( 1 + sign(p(i_r)) ) = 0

Remembering that:

p(x) = n_wΣg=1w_gx_g + b

We can find the p' for both a w_k and b:

dpdw_k = x_k

dpdb = 1

And then substitute it into the equality:

For w_k:

nΣr=1 ( p(i_r)i_rk - y_ri_rk )( 1 + sign(p(i_r)) ) = 0

For b:

nΣr=1 ( p(i_r) - y_r )( 1 + sign(p(i_r)) ) = 0

Solving both of the equalities for w_k and b, we get:

w_k = nΣr=1 i_rk (y_r - n_wΣg=1,g≠k w_gi_rg - b ) × z(i_r) nΣr=1 (i_rk)²z(i_r)

b = nΣr=1 (y_r - n_wΣg=1 w_gi_rg ) × z(i_r) nΣr=1 z(i_r)

Where z, for simplicity, is:

z(x) = 1 + sign(p(x))

If I was to explain the function 'z', it just means whether the entire data row is going to be inside of the calculation. When the sign of the prediction is -1, z becomes equal to 0, and when multiplied it just removes the data row from calculation, which makes sense when we consider that the ReLU function makes everything under zero, a zero.

If interested, my notes for deriving these formulas completely are available here

OPE Prop formulas on this website are licensed under the CC BY-SA 4.0 License. More details here