MSE with ReLU using OPE Prop
The ReLU function is one of the most well-known and used activation functions, and paired together with MSE, the combination is very suitable for any regression and linear models. This page discusses the formulas of OPE Prop for the MSE and ReLU function pair. Let's begin!
Here are the final formulas:
wk = nΣr=1 irk (yr - nwΣg=1,g≠k wgirg - b ) × z(ir) nΣr=1 (irk)2z(ir)
b = nΣr=1 (yr - nwΣg=1 wgirg ) × z(ir) nΣr=1 z(ir)
Where z, for simplicity, is:
z(x) = 1 + sign(p(x))
Unpacking the Math
Looking at the final formulas at the start can be important in some cases, or if you are just not interested in the full details. This next section will delve into how the formulas for MSE and ReLU were derived, so if you want, I encourage you to dive in!
Firstly, if we assume that:
i - is the matrix of all input entries, where the first dimension is the data row, and the second is the individual input
y - is a vector of answers
n - is the length of the dataset
w - is a vector of weights
nw - is the number of weights
b - is the bias
And represent the ReLU function as:
a(x) = x + |x|2
The derivative of which is:
a'(x) = 1 + sign(x)2
Then using the OPE Prop formula for MSE and any activation function, which is:
nΣr=1(a( p(ir) ) - yr) × a'( p(ir) ) × p'(ir) = 0
And substituting 'a' for ReLU, we get:
nΣr=1 ( x + |x|2 - yr ) × 1 + sign( p(ir) )2 × p'(ir) = 0
Which simplifies to
nΣr=1 ( p(ir)p'(ir) - yrp'(r) )( 1 + sign(p(ir)) ) = 0
Remembering that:
p(x) = nwΣg=1wgxg + b
We can find the p' for both a wk and b:
dpdwk = xk
dpdb = 1
And then substitute it into the equality:
For wk:
nΣr=1 ( p(ir)irk - yrirk )( 1 + sign(p(ir)) ) = 0
For b:
nΣr=1 ( p(ir) - yr )( 1 + sign(p(ir)) ) = 0
Solving both of the equalities for wk and b, we get:
wk = nΣr=1 irk (yr - nwΣg=1,g≠k wgirg - b ) × z(ir) nΣr=1 (irk)2z(ir)
b = nΣr=1 (yr - nwΣg=1 wgirg ) × z(ir) nΣr=1 z(ir)
Where z, for simplicity, is:
z(x) = 1 + sign(p(x))
If I was to explain the function 'z', it just means whether the entire data row is going to be inside of the calculation. When the sign of the prediction is -1, z becomes equal to 0, and when multiplied it just removes the data row from calculation, which makes sense when we consider that the ReLU function makes everything under zero, a zero.
If interested, my notes for deriving these formulas completely are available here
OPE Prop formulas on this website are licensed under the CC BY-SA 4.0 License. More details here