non-computability of overfitting

Fri Jan 21 11:07:10 EST 2022

Dear FOM members,
In the field of (supervised) machine learning, we consider two finite
sequences x_train = (x_1, ..., x_n) (features) and y_train = (y_1, ...,
y_n) (labels). We are interested in inferring the rule to obtain y_i from
x_i. A possible way [1] to formalize the notion of learning in the
framework of Kolmogorov complexity [2] could be by identifying "learning"
with the smallest algorithm D such that D(x_i) = y_i for all i = 1, 2, ...,
n. This algorithm is called the decompressor of x_train and y_train. Then,
given another sequence x_test = (x_{n+1}, x_{n+2}, ..., x_{n+k}), the
predictions of this learning process is the sequence y_test = ( D(x_{n+1}),
D(x_{n+2}), ..., D(x_{n+k}) ). The practical problem is that the function
associating x_train and x_test to D is non-computable.

What is usually done by machine learning engineers is to consider a space
[3, 4] containing the algorithms E such that E(x_i) approximates y_i for
all i = 1, 2, ..., n and smoothly moves from one of such algorithms (point)
to a neighbor algorithm (point) to improve the approximation. The
improvements of the approximation are given by the iterations of a method,
e.g., gradient descent. It was empirically observed that, in practical
situations, when considering the dataset for testing x_test =
(x_{n+1}, x_{n+2}, ..., x_{n+k}), the predictions E( x_{n + i} )
increasingly approximate the values that they are intended to predict as
the number of iterations increases, but when the number of iterations is
too large, the approximation degenerates and it is worst than the result
obtained with fewer iterations. This phenomenon is known as overfitting.

A possible way to mathematically formalize overfitting, although not the
only one, could be to suppose that E( x_{n + i} ) is intended to predict D(
x_{n + i} ) for all i = 1, 2, ..., k. In this framework, could be proved
that the maximum number of iterations just before overfitting is a
non-computable function of the parameters of the problem? I am not
excluding that this number could be computable in some particular cases.

Kind regard,
Jose M.

References:
[1] Verbeek, Jakob J. "Overfitting using the Minimum Description Length
Principle." (2000).
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9142&rep=rep1&type=pdf

[2] Pearlmutter, Barak A., and Ronald Rosenfeld. "Chaitin-Kolmogorov
complexity and generalization in neural networks." *Advances in neural
information processing systems* (1991): 925-931.

[3] Dong, Xiao, Jiasong Wu, and Ling Zhou. "How deep learning works--The
geometry of deep learning." *arXiv preprint arXiv:1710.10784* (2017).

[4] Neyshabur B, Tomioka R, Salakhutdinov R, Srebro N. Geometry of
optimization and implicit regularization in deep learning. arXiv preprint
arXiv:1705.03071. 2017 May 8.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/fom/attachments/20220121/761fcb2d/attachment.html>