How back-Propagation works

时间:2024-03-23 15:31:47

In order to minimize the cost function, we expect small changes in weights lead to samll changes in output. So we can use this property to modify weights to make network getting closer to what we want step by step. This is also one reason why we need smooth activation function rather than {0, 1}


1. The intuitive idea

ΔCostCostwjΔwj

Δwj should be small enough to make sure a good approximation

if we want to decrease Cost , which means ΔCost<0. Therefore, Δwj should be an opposite sign with Costwj.

Here, let Δw=ηwCost, η is learning rate. Basically, η=ϵ||wCost|| enables the minimization of ΔCost in each updation. ϵ=||Δw||. (||Δw|| is small to ensure the approximation)




2. Back-Propagation

Warming-up:

wCost or Costw tells us how quickly the Cost changes when we update the w.

2.1 Introduction to Notations

wljk : connection from kth neuron in the (l1)th layer to jth neuron in the lth layer.

alj : the activation or output of the jth neuron in the lth layer.

zlj : the Input to the jth neuron in the lth layer.

How back-Propagation works

alj=f(kwljkal1k) can be vectorized as:

al=f(wlal1)

(w3 in above picture is a 2 * 4 matrix【the first row corresponds to the first neuron and second row corresponds to the second neuron in layer 3】; a3 is a 2*1 matrix or vector)

On the other hand, for convenience reason, we define:

zl=wlal1
al=f(zl)



2.2 The Fundamental Equations

In the beginning, we set ERROR in the neuron:

δlj=Costzlj

This definition makes sense. Suppose it adds a little change (Δzlj) to the corresponding neuron (zlj). Then its output will be f(zlj+Δzlj) instead of f(zlj). This change propagates through later layer and the overall change is approximating CostzljΔzlj.

In this case, if Costzlj is close to zero, the changes of Cost function can’t be great. We can think this neuron is near optimal. Thus, the heuristic sense in Costzlj is a measurement of error in the neuron.

  1. An equation for the error in the output layer

    δL=CostzL(L represents the last layer)

    δL=CostaLaLzL=CostaLf(zL)=aLCostf(zL)( represents componentwise production)


  2. An equation for the error δl in terms of the error in the next layer, δl+1:

δl=Costzl=Costzl+1zl+1zl=(δl+1(wl+1)T)f(zl)

More specifically, for an individual neuron:

δlj=Costzlj=kCostzl+1kzl+1kzlk=kδl+1kzl+1kzlk

Meanwhile
zl+1k=jwl+1kjalj=jwl+1kjf(zlj)

So
δlj=kδl+1kzl+1kzlk=kwlkjδl+1kf(zlj)

using (wl)Tδl to obtain δl1 seems propagate error backward through the network. We then take the Hadamard product f(zl1). This moves the error backward through the activation function in layer l1


  1. An equation for the rate of change of the cost with respect to any weight in the network:
    Costwljk=Costzljzljwljk=δljal1k

zlj=kwljkal1k

For less-index version:

Costw=δlal1likeainδout

How back-Propagation works

it’s understood that ain is the activation of the neuron input to the weight w, and δout is the error of the neuron output from the weight w.

Note if ain is small, Costw is also small. we’ll say the weight learns slowly, meaning that it’s not changing much during gradient descent. Recall from the graph of the sigmoid function that the σ function becomes very flat in the end sides. It occurs that its derivative is approximately 0. In this case it’s common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly).


psudo code for updation w: (m is the num of batch)

wl=wlηmxδx,l(ax,l1)T

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network