The principle has its origin in the philosophy of science. In judging between scientific theories, other things being equal, one prefers the simpler explanation. A powerful theory, such as the atomic theory or the theory of gravitation is one that is compact but accounts for a large number of phenomena.
Many different forms of learning can be characterized in terms of the MDL principle:
General rules: The assertion of a general rule eliminates the need to specify each instance. For instance, given a table of animals and their features, the rule ``All crows are black'' means that the ``color'' field can be omitted for each crow in the table.
General rules with exceptions: The assertion of a rule with a small number of exceptions saves space in all cases where the rule applies; the exceptions can be enumerated separately.
Numerical rules. If one numeric parameter is a function of another, then the tabulation of the values of the two parameters can be replaced by a tabulation of the values of the independent parameter and the function. For example, data consisting of the pairs "X=1.0, Y=3.0; X=1.2, Y=3.2; X=2.7, Y=4.7; X=5.9, Y=7.9'' can be replaced by the rule "Y = X+2.0" and the values "X=1.0; X=1.2; X=2.7; X=5.9". Of course, one has to be sure that the rule is shorter than the data. Any K points can be interpolated by a K-1 degree polynomial, but since the polynomial requires K coefficients there is no gain in compactness.
Approximate numerical rules. If one numeric parameter is approximated closely by a function of another, then the tabulation of the two parameters can be replaced by giving the function, giving the values of the dependent parameter, and giving the value of the error. Since the error is smaller than the value of the function, it can be represented more compactly, to any specified degree of accuracy. For example, data consisting of the pairs "X=1.0, Y=3.1; X=1.2, Y=3.2; X=2.7, Y=4.8; X=5.9, Y=7.8'' can be replaced by the rule "Y = X+2.0+E" and the values "X=1.0, E=-0.1; X=1.2, E=0.0; X=2.7, E=0.1; X=5.9, E=-0.1". In the initial data, we have to represent values of Y between 1 and 8 to an accuracy of 0.1, which would require 7 bits each. By contrast, E takes on only 3 values, and therefore requires only two bits per data point.
Arbitrary exact classifier Any classification system be viewed as MDL as follows: Given a table of labelled instances, delete the desired output, and replace it with a description of the classifier The original table can then be reconstructed by applying the network to the instance.
Arbitrary inexact classifier Suppose you have a classification system that does better than randomly guessing the classification attribute according to its distribution. Let C be the classification attribute and R the prediction. Then the conditional entropy of C given R is less than the entropy of C CEnt(C|R) < Ent(C). Therefore, using R you can find an encoding of C that is shorter than the optimal encoding without using R. So if you encode the C column using the description of the classifier plus the optimal encoding of C given R, you will be shorter than the optimal encoding of C without R.
Clustering: Suppose that instances X1 ... Xn are divided into clusters C1 ... Ck. Let M(Ci) be the center of Ci (or a typical instances of Ci). Then the data can be represented as follows:
Learning from positive examples: In practice, learning is often done just from a corpus of positive examples, with no negative examples. For instance, in language learning, a child is presented almost exclusively with correct usages of the language; it does not have to rely on incorrect usages being pointed out. In what sense, then, is a grammar of the language better then the simple grammar S -> word*, which also accounts for all the positive data?
MDL can provide an answer. If the language does obey some non-trivial grammar, then that grammar can be used to derive a more compact representation of a corpus of sentences. For instance, consider a context-free grammar where all the rules have been written in the form NON-TERMINAL -> SYMBOL SYMBOL ... (Chomsky normal form). We can develop an encoding as follows: For each nonterminal, number the productions from that non-terminal. Then, doing a depth-first search of the parse tree, give the number of the production used at each node. The result is generally an encoding that is a little shorter than the original sentence, since the number of choices at each step is less than the number of words in the language.
For instance, suppose the grammar is as follows:
1) S -> NP VP
1) NP -> pronoun
2) NP -> det ADJ_LIST noun PP_LIST
1) VP -> verb PP_LIST
2) VP -> verb NP
1) ADJ_LIST -> null
2) ADJ_LIST -> adj ADJ_LIST
1) PP_LIST -> null
2) PP_LIST -> PP PP_LIST
1) PP -> prep NP
1) pronoun -> (1) I (2) you ...
noun -> (1) boys (2) fish ...
verb -> (1) fish (2) ate ...
det -> (1) a (2) the ...
adj -> (1) cold (2) sharp ...
prep -> (1) with (2) of ...
Now, the parse tree
S(1) ---> NP(2) ---> det(2) ---> the | | | |-> ADJ_LIST(1) -> null | | | |-> noun(1) ---> boys | |-> VP(2) ---> verb(2) ---> ate | |-> NP(2) ---> det(1) ---> a | |-> ADJ_LIST(2) ---> adj(1) ---> cold | | | |-> ADJ_LIST(1) ---> null | |-> noun(2) -> fish | |-> PP_LIST(1) -> nullcan be represented as the sequence 1,2,2,1,1,2,2,2,1,2,1,1,2,1.
Explanation of overfitting. The MDL theory gives an elegant explanation of why too rich representational schemes tend to overfit: When the encoding of the classifier itself is longer than the original data, or almost as long, then nothing is gained in terms of description length. E.g. You can represent K numbers as the values of a K-1 degree polynomial, but no description length is gained, since you now need K values for the coefficients of the polynomials. You can exactly fit a decision tree to data, if there is a separate leaf for each datum, but again no gain. You can cluster N points tightly into N clusters, one per point, but again no gain.