Gradient descent

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent. However, gradient descent should not be confused with the method of steepest descent for approximating integrals. Gradient descent is based on the observation that if the multi-variable function F ( x ) {displaystyle F(mathbf {x} )} is defined and differentiable in a neighborhood of a point a {displaystyle mathbf {a} } , then F ( x ) {displaystyle F(mathbf {x} )} decreases fastest if one goes from a {displaystyle mathbf {a} } in the direction of the negative gradient of F {displaystyle F} at a , − ∇ F ( a ) {displaystyle mathbf {a} ,- abla F(mathbf {a} )} . It follows that, if for γ ∈ R + {displaystyle gamma in mathbb {R} _{+}} small enough, then F ( a n ) ≥ F ( a n + 1 ) {displaystyle F(mathbf {a_{n}} )geq F(mathbf {a_{n+1}} )} . In other words, the term γ ∇ F ( a ) {displaystyle gamma abla F(mathbf {a} )} is subtracted from a {displaystyle mathbf {a} } because we want to move against the gradient, toward the minimum. With this observation in mind, one starts with a guess x 0 {displaystyle mathbf {x} _{0}} for a local minimum of F {displaystyle F} , and considers the sequence x 0 , x 1 , x 2 , … {displaystyle mathbf {x} _{0},mathbf {x} _{1},mathbf {x} _{2},ldots } such that We have a monotonic sequence so hopefully the sequence ( x n ) {displaystyle (mathbf {x} _{n})} converges to the desired local minimum. Note that the value of the step size γ {displaystyle gamma } is allowed to change at every iteration. With certain assumptions on the function F {displaystyle F} (for example, F {displaystyle F} convex and ∇ F {displaystyle abla F} Lipschitz) and particular choices of γ {displaystyle gamma } (e.g., chosen either via a line search that satisfies the Wolfe conditions or the Barzilai-Borwein method shown as following), convergence to a local minimum can be guaranteed. When the function F {displaystyle F} is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution. This process is illustrated in the adjacent picture. Here F {displaystyle F} is assumed to be defined on the plane, and that its graph has a bowl shape. The blue curves are the contour lines, that is, the regions on which the value of F {displaystyle F} is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is orthogonal to the contour line going through that point. We see that gradient descent leads us to the bottom of the bowl, that is, to the point where the value of the function F {displaystyle F} is minimal. Gradient descent has problems with pathological functions such as the Rosenbrock function shown here.

Parent Topic

Child Topic

No Parent Topic