Backpropagation

Backpropagation algorithms are a family of methods used to efficiently train artificial neural networks (ANNs) following a gradient descent approach that exploits the chain rule. The main feature of backpropagation is its iterative, recursive and efficient method for calculating the weights updates to improve the network until it is able to perform the task for which it is being trained. It is closely related to the Gauss–Newton algorithm. ∂ E ∂ w i j = ∂ E ∂ o j ∂ o j ∂ w i j = ∂ E ∂ o j ∂ o j ∂ net j ∂ net j ∂ w i j {displaystyle {frac {partial E}{partial w_{ij}}}={frac {partial E}{partial o_{j}}}{frac {partial o_{j}}{partial w_{ij}}}={frac {partial E}{partial o_{j}}}{frac {partial o_{j}}{partial { ext{net}}_{j}}}{frac {partial { ext{net}}_{j}}{partial w_{ij}}}} (Eq. 1) ∂ net j ∂ w i j = ∂ ∂ w i j ( ∑ k = 1 n w k j o k ) = ∂ ∂ w i j w i j o i = o i . {displaystyle {frac {partial { ext{net}}_{j}}{partial w_{ij}}}={frac {partial }{partial w_{ij}}}left(sum _{k=1}^{n}w_{kj}o_{k} ight)={frac {partial }{partial w_{ij}}}w_{ij}o_{i}=o_{i}.} (Eq. 2) ∂ o j ∂ net j = ∂ φ ( net j ) ∂ net j {displaystyle {frac {partial o_{j}}{partial { ext{net}}_{j}}}={frac {partial varphi ({ ext{net}}_{j})}{partial { ext{net}}_{j}}}} (Eq. 3) ∂ E ∂ o j = ∂ E ∂ y {displaystyle {frac {partial E}{partial o_{j}}}={frac {partial E}{partial y}}} (Eq. 4) ∂ E ∂ o j = ∑ ℓ ∈ L ( ∂ E ∂ net ℓ ∂ net ℓ ∂ o j ) = ∑ ℓ ∈ L ( ∂ E ∂ o ℓ ∂ o ℓ ∂ net ℓ ∂ net ℓ ∂ o j ) = ∑ ℓ ∈ L ( ∂ E ∂ o ℓ ∂ o ℓ ∂ net ℓ w j ℓ ) {displaystyle {frac {partial E}{partial o_{j}}}=sum _{ell in L}left({frac {partial E}{partial { ext{net}}_{ell }}}{frac {partial { ext{net}}_{ell }}{partial o_{j}}} ight)=sum _{ell in L}left({frac {partial E}{partial o_{ell }}}{frac {partial o_{ell }}{partial { ext{net}}_{ell }}}{frac {partial { ext{net}}_{ell }}{partial o_{j}}} ight)=sum _{ell in L}left({frac {partial E}{partial o_{ell }}}{frac {partial o_{ell }}{partial { ext{net}}_{ell }}}w_{jell } ight)} (Eq. 5) Backpropagation algorithms are a family of methods used to efficiently train artificial neural networks (ANNs) following a gradient descent approach that exploits the chain rule. The main feature of backpropagation is its iterative, recursive and efficient method for calculating the weights updates to improve the network until it is able to perform the task for which it is being trained. It is closely related to the Gauss–Newton algorithm. Backpropagation requires the derivatives of activation functions to be known at network design time. Automatic differentiation is a technique that can automatically and analytically provide the derivatives to the training algorithm. In the context of learning, backpropagation is commonly used by the gradient descent optimization algorithm to adjust the weight of neurons by calculating the gradient of the loss function; backpropagation computes the gradient(s), whereas (stochastic) gradient descent uses the gradients for training the model (via optimization). The goal of any supervised learning algorithm is to find a function that best maps a set of inputs to their correct output. The motivation for backpropagation is to train a multi-layered neural network such that it can learn the appropriate internal representations to allow it to learn any arbitrary mapping of input to output. To understand the mathematical derivation of the backpropagation algorithm, it helps to first develop some intuition about the relationship between the actual output of a neuron and the correct output for a particular training example. Consider a simple neural network with two input units, one output unit and no hidden units, and in which each neuron uses a linear output (unlike most work on neural networks, in which mapping from inputs to outputs is non-linear) that is the weighted sum of its input. Initially, before training, the weights will be set randomly. Then the neuron learns from training examples, which in this case consist of a set of tuples ( x 1 , x 2 , t ) {displaystyle (x_{1},x_{2},t)} where x 1 {displaystyle x_{1}} and x 2 {displaystyle x_{2}} are the inputs to the network and t is the correct output (the output the network should produce given those inputs, when it has been trained). The initial network, given x 1 {displaystyle x_{1}} and x 2 {displaystyle x_{2}} , will compute an output y that likely differs from t (given random weights). A loss function L ( t , y ) {displaystyle L(t,y)} is used for measuring the discrepancy between the expected output t and the actual output y. For regression analysis problems the squared error can be used as a loss function, for classification the categorical crossentropy can be used.

Parent Topic

Child Topic

No Parent Topic