Stochastic approximation

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations. Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations. For instance in engineering many optimization problems are often of this type when you do not have a mathematical model of the system (which can be too complex) but still would like to optimize its behavior by adjusting certain parameters. For this purpose, you can do experiments or run simulations to evaluate the performance of the system at given values of the parameters. Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.. These methods are used often in statistics and machine learning that typically need to handle noisy measurements of empirical data. They are also related to stochastic optimization methods and algorithms. In a nutshell, stochastic approximation algorithms deal with a function of the form f ( θ ) = E ξ ⁡ [ F ( θ , ξ ) ] { extstyle f( heta )=operatorname {E} _{xi }} which is the expected value of a function depending on a random variable ξ { extstyle xi } . The goal is to recover properties of such a function f { extstyle f} without evaluating it directly. Instead, stochastic approximation algorithms use random samples of F ( θ , ξ ) { extstyle F( heta ,xi )} to efficiently approximate properties of f { extstyle f} such as zeros or extrema. The earliest, and prototypical, algorithms of this kind are the Robbins-Monro and Kiefer-Wolfowitz algorithms introduced respectively in 1951 and 1952. The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro, presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function M ( θ ) { extstyle M( heta )} , and a constant α { extstyle alpha } , such that the equation M ( θ ) = α { extstyle M( heta )=alpha } has a unique root at θ ∗ { extstyle heta ^{*}} . It is assumed that while we cannot directly observe the function M ( θ ) { extstyle M( heta )} , we can instead obtain measurements of the random variable N ( θ ) { extstyle N( heta )} where E ⁡ [ N ( θ ) ] = M ( θ ) { extstyle operatorname {E} =M( heta )} . The structure of the algorithm is to then generate iterates of the form: Here, a 1 , a 2 , … {displaystyle a_{1},a_{2},dots } is a sequence of positive step sizes. Robbins and Monro proved , Theorem 2 that θ n {displaystyle heta _{n}} converges in L 2 {displaystyle L^{2}} (and hence also in probability) to θ {displaystyle heta } , and Blum later proved the convergence is actually with probability one, provided that: A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: a n = a / n { extstyle a_{n}=a/n} , for a > 0 { extstyle a>0} . Other series are possible but in order to average out the noise in N ( θ ) { extstyle N( heta )} , the above condition must be met. While the Robbins-Monro algorithm is theoretically able to achieve O ( 1 / n ) { extstyle O(1/n)} under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning. Chung(1954) and Fabian(1968) showed that we would achieve optimal convergence rate O ( 1 / n ) { extstyle O(1/{sqrt {n}})} with a n = ▽ 2 f ( θ ∗ ) − 1 / n { extstyle a_{n}=igtriangledown ^{2}f( heta ^{*})^{-1}/n} (or a n = 1 ( n M ′ ( θ ∗ ) ) { extstyle a_{n}={frac {1}{(nM'( heta ^{*}))}}} ). Lai and Robbins designed adaptive procedures to estimate M ′ ( θ ∗ ) { extstyle M'( heta ^{*})} such that θ n { extstyle heta _{n}} has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak(1991) and Ruppert(1988) independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky also presented a method of accelerating Robbins-Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:

Parent Topic

Child Topic

No Parent Topic