The principle has its origin in the philosophy of science. In judging between scientific theories, other things being equal, one prefers the simpler explanation. A powerful theory, such as the atomic theory or the theory of gravitation is one that is compact but accounts for a large number of phenomena.

Many different forms of learning can be characterized in terms of the MDL principle:

** General rules:** The assertion of a general rule eliminates the need
to specify each instance. For instance, given a table of animals and their
features, the rule ``All crows are black'' means that the ``color'' field
can be omitted for each crow in the table.

** General rules with exceptions:** The assertion of a rule
with a small number of exceptions saves space in all cases where the rule
applies; the exceptions can be enumerated separately.

** Numerical rules.** If one numeric parameter is a function of another,
then the tabulation of the values of the two parameters can be replaced
by a tabulation of the values of the independent parameter and the function.
For example, data consisting of the pairs "X=1.0, Y=3.0; X=1.2, Y=3.2;
X=2.7, Y=4.7; X=5.9, Y=7.9'' can be replaced by the rule "Y = X+2.0" and
the values "X=1.0; X=1.2; X=2.7; X=5.9". Of course, one has to be sure
that the rule is shorter than the data. Any K points can be interpolated
by a K-1 degree polynomial, but since the polynomial requires K coefficients
there is no gain in compactness.

** Approximate numerical rules. ** If one numeric parameter is approximated
closely by a function of another, then the tabulation of the two parameters can
be replaced by giving the function, giving the values of the dependent
parameter, and giving the value of the error. Since the error is smaller than
the value of the function, it can be represented more compactly, to any
specified degree of accuracy.
For example, data consisting of the pairs "X=1.0, Y=3.1; X=1.2, Y=3.2;
X=2.7, Y=4.8; X=5.9, Y=7.8'' can be replaced by the rule "Y = X+2.0+E" and
the values "X=1.0, E=-0.1; X=1.2, E=0.0; X=2.7, E=0.1; X=5.9, E=-0.1".
In the initial data, we have to represent values of Y between 1 and 8 to an
accuracy of 0.1, which would require 7 bits each. By contrast, E takes
on only 3 values, and therefore requires only two bits per data point.

** Neural networks:** A neural network, or any other program, to do
classification can
be viewed as MDL as follows: Given a table of labelled instances, delete
the desired output, and replace it with a description of the neural network.
The original table can then be reconstructed by applying the network to
the instance.

** Learning from positive examples: ** In practice, learning is often
done just from a corpus of positive examples, with no negative examples.
For instance, in language learning, a child is presented almost exclusively
with correct usages of the language; it does not have to rely on
incorrect usages being pointed out. In what sense, then, is a grammar
of the language better then the simple grammar S -> word*, which also
accounts for all the positive data?

MDL can provide an answer. If the language does obey some non-trivial grammar, then that grammar can be used to derive a more compact representation of a corpus of sentences. For instance, consider a context-free grammar where all the rules have been written in the form NON-TERMINAL -> SYMBOL SYMBOL ... (Chomsky normal form). We can develop an encoding as follows: For each nonterminal, number the productions from that non-terminal. Then, doing a depth-first search of the parse tree, give the number of the production used at each node. The result is generally an encoding that is a little shorter than the original sentence, since the number of choices at each step is less than the number of words in the language.

For instance, suppose the grammar is as follows:

1) S -> NP VP

1) NP -> pronoun

2) NP -> det ADJ_LIST noun PP_LIST

1) VP -> verb PP_LIST

2) VP -> verb NP

1) ADJ_LIST -> null

2) ADJ_LIST -> adj ADJ_LIST

1) PP_LIST -> null

2) PP_LIST -> PP PP_LIST

1) PP -> prep NP

1) pronoun -> (1) I (2) you ...

noun -> (1) boys (2) fish ...

verb -> (1) fish (2) ate ...

det -> (1) a (2) the ...

adj -> (1) cold (2) sharp ...

prep -> (1) with (2) of ...

Now, the parse tree

S(1) ---> NP(2) ---> det(2) ---> the | | | |-> ADJ_LIST(1) -> null | | | |-> noun(1) ---> boys | |-> VP(2) ---> verb(2) ---> ate | |-> NP(2) ---> det(1) ---> a | |-> ADJ_LIST(2) ---> adj(1) ---> cold | | | |-> ADJ_LIST(1) ---> null | |-> noun(2) -> fish | |-> PP_LIST(1) -> nullcan be represented as the sequence 1,2,2,1,1,2,2,2,1,2,1,1,2,1.

** Clustering: ** Suppose that instances X1 ... Xn
are divided into clusters C1 ... Ck. Let M(Ci) be the
center of Ci (or a typical instances of Ci).
Then the data can be represented as follows:

- Record M1 ... Mk.
- For each instance Xi, record the difference between Xi and M(cluster(Xi)).