language-icon Old Web
English
Sign In

Minimum description length

The minimum description length (MDL) principle is a formalization of Occam's razor in which the best hypothesis (a model and its parameters) for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978. It is an important concept in information theory and computational learning theory. is based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. (Grünwald, 1998) The minimum description length (MDL) principle is a formalization of Occam's razor in which the best hypothesis (a model and its parameters) for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978. It is an important concept in information theory and computational learning theory. Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet. MDL is a theory of inductive and statistical inference that starts out with this idea: all statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to compress the data most. Like other statistical methods, it can be used for learning the parameters of a model using some data. Usually though, standard statistical methods assume that the general form of a model is fixed. MDL's main strength is that it can also be used for selecting the general form of a model and its parameters. The quantity of interest (sometimes just a model, sometimes just parameters, sometimes both at the same time) is called a hypothesis. The basic idea is then to consider the (lossless) two-stage code that encodes data D {displaystyle D} with length L ( D ) {displaystyle {L(D)}} by first encoding a hypothesis H {displaystyle H} in the set of considered hypotheses H {displaystyle {cal {H}}} and then coding D {displaystyle D} 'with the help of' H {displaystyle H} ; in the simplest context this just means 'encoding the deviations of the data from the predictions made by H {displaystyle H} : The H {displaystyle H} achieving this minimum is then viewed as the best explanation of data D {displaystyle D} . As a simple example, take a regression problem: the data D {displaystyle D} could consist of a sequence of points D = ( x 1 , y 1 ) , … , ( x n , y n ) {displaystyle D=(x_{1},y_{1}),ldots ,(x_{n},y_{n})} , the set H {displaystyle {cal {H}}} could be the set of all polynomials from X {displaystyle X} to Y {displaystyle Y} . To describe a polynomial H {displaystyle H} of degree (say) k {displaystyle k} , one would first have to discretize the parameters to some precision; one would then have to describe this precision (a natural number); next, one would have to describe the degree k {displaystyle k} (another natural number), and in the final step, one would have to describe k + 1 {displaystyle k+1} parameters; the total length would be L ( H ) {displaystyle L(H)} . One would then describe the points in D {displaystyle D} using some fixed code for the x-values and then using a code for the n {displaystyle n} deviations y i − H ( x i ) {displaystyle y_{i}-H(x_{i})} . In practice, one often (but not always) uses a probabilistic model. For example, one associates each polynomial H {displaystyle H} with the corresponding conditional distribution expressing that given X {displaystyle X} , Y {displaystyle Y} is normally distributed with mean H ( X ) {displaystyle H(X)} and some variance σ 2 {displaystyle sigma ^{2}} which could either be fixed or added as a free parameter. Then the set of hypotheses H {displaystyle {cal {H}}} reduces to the assumption of a linear model, Y = H ( X ) + ϵ {displaystyle Y=H(X)+epsilon } , with H {displaystyle H} a polynomial. Furthermore, one is often not directly interested in specific parameters values, but just, for example, the degree of the polynomial. In that case, one sets H {displaystyle {cal {H}}} to be H = { H 0 , H 1 , … } {displaystyle {cal {H}}={{cal {H}}_{0},{cal {H}}_{1},ldots }} where each H j {displaystyle {cal {H}}_{j}} represents the hypothesis that the data is best described as a j-th degree polynomial. One then codes data D {displaystyle D} given hypothesis H j {displaystyle {cal {H}}_{j}} using a one-part code designed such that, whenever some hypothesis H ∈ H j {displaystyle Hin {cal {H}}_{j}} fits the data well, the codelength L ( D | H ) {displaystyle L(D|H)} is short. The design of such codes is called universal coding. There are various types of universal codes one could use, often giving similar lengths for long data sequences but differing for short ones. The 'best' (in the sense that it has a minimax optimality property) are the normalized maximum likelihood (NML) or Shtarkov codes. A quite useful class of codes are the Bayesian marginal likelihood codes. For exponential families of distributions, when Jeffreys prior is used and the parameter space is suitably restricted, these asymptotically coincide with the NML codes; this brings MDL theory in close contact with objective Bayes model selection, in which one also sometimes adopts Jeffreys' prior, albeit for different reasons. To select the hypothesis that captures the most regularity in the data, scientists look for the hypothesis with which the best compression can be achieved. In order to do this, a code is fixed to compress the data. Arguably, the most general code one could use is a (Turing-complete) computer language. A program to output the data is written in that language; thus the program effectively represents the data. The length of the shortest program that outputs the data is called the Kolmogorov complexity of the data. This is the central idea of Ray Solomonoff's idealized theory of inductive inference, which has been a source of inspiration for MDL.

[ "Algorithm", "Statistics", "Machine learning", "Artificial intelligence", "Pattern recognition", "normalized maximum likelihood", "minimum description length criterion" ]
Parent Topic
Child Topic
    No Parent Topic