Softmax function

In mathematics, the softmax function, also known as softargmax or normalized exponential function,:198 is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval ( 0 , 1 ) {displaystyle (0,1)} , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes.We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.For any input, the outputs must all be positive and they must sum to unity. ... In mathematics, the softmax function, also known as softargmax or normalized exponential function,:198 is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval ( 0 , 1 ) {displaystyle (0,1)} , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes. The standard (unit) softmax function σ : R K → R K {displaystyle sigma :mathbb {R} ^{K} o mathbb {R} ^{K}} is defined by the formula In words: we apply the standard exponential function to each element z i {displaystyle z_{i}} of the input vector z {displaystyle mathbf {z} } and normalize these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector σ ( z ) {displaystyle sigma (mathbf {z} )} is 1. Instead of e, a different base b > 0 can be used; choosing a larger value of b will create a probability distribution that is more concentrated around the positions of the largest input values. Writing b = e β {displaystyle b=e^{eta }} or b = e − β {displaystyle b=e^{-eta }} (for real β) yields the expressions: In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter β is varied. The name 'softmax' is misleading; the function is not a smooth maximum (a smooth approximation to the maximum function), but is rather a smooth approximation to the arg max function: the function whose value is which index has the maximum. In fact, the term 'softmax' is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term 'softargmax', but the term 'softmax' is conventional in machine learning. For this section, the term 'softargmax' is used to emphasize this interpretation. Formally, instead of considering the arg max as a function with categorical output 1 , … , n {displaystyle 1,dots ,n} (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique max arg): where the output coordinate y i = 1 {displaystyle y_{i}=1} if and only if i {displaystyle i} is the arg max of ( z 1 , … , z n ) {displaystyle (z_{1},dots ,z_{n})} , meaning z i {displaystyle z_{i}} is the unique maximum value of ( z 1 , … , z n ) {displaystyle (z_{1},dots ,z_{n})} . For example, in this encoding a r g m a x ⁡ ( 1 , 5 , 10 ) = ( 0 , 0 , 1 ) , {displaystyle operatorname {arg,max} (1,5,10)=(0,0,1),} since the third argument is the maximum. This can be generalized to multiple arg max values (multiple equal z i {displaystyle z_{i}} being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, a r g m a x ⁡ ( 1 , 5 , 5 ) = ( 0 , 1 / 2 , 1 / 2 ) , {displaystyle operatorname {arg,max} (1,5,5)=(0,1/2,1/2),} since the second and third argument are both the maximum. In case all arguments are equal, this is simply a r g m a x ⁡ ( z , … , z ) = ( 1 / n , … , 1 / n ) . {displaystyle operatorname {arg,max} (z,dots ,z)=(1/n,dots ,1/n).} Points z with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a jump discontinuity) – while points with a single arg max are known as non-singular or regular points.

Parent Topic

Child Topic

No Parent Topic