non-computability of overfitting
JOSEPH SHIPMAN
joeshipman at aol.com
Fri Jan 21 20:53:31 EST 2022
This is an excellent conjecture, because, if true, it would be the best example I know of of a natural uncomputable function with easily computable upper and lower bounds that aren’t very far from each other.
— JS
Sent from my iPhone
> On Jan 21, 2022, at 7:51 PM, José Manuel Rodríguez Caballero <josephcmac at gmail.com> wrote:
>
>
> Dear FOM members,
> In the field of (supervised) machine learning, we consider two finite sequences x_train = (x_1, ..., x_n) (features) and y_train = (y_1, ..., y_n) (labels). We are interested in inferring the rule to obtain y_i from x_i. A possible way [1] to formalize the notion of learning in the framework of Kolmogorov complexity [2] could be by identifying "learning" with the smallest algorithm D such that D(x_i) = y_i for all i = 1, 2, ..., n. This algorithm is called the decompressor of x_train and y_train. Then, given another sequence x_test = (x_{n+1}, x_{n+2}, ..., x_{n+k}), the predictions of this learning process is the sequence y_test = ( D(x_{n+1}), D(x_{n+2}), ..., D(x_{n+k}) ). The practical problem is that the function associating x_train and x_test to D is non-computable.
>
> What is usually done by machine learning engineers is to consider a space [3, 4] containing the algorithms E such that E(x_i) approximates y_i for all i = 1, 2, ..., n and smoothly moves from one of such algorithms (point) to a neighbor algorithm (point) to improve the approximation. The improvements of the approximation are given by the iterations of a method, e.g., gradient descent. It was empirically observed that, in practical situations, when considering the dataset for testing x_test = (x_{n+1}, x_{n+2}, ..., x_{n+k}), the predictions E( x_{n + i} ) increasingly approximate the values that they are intended to predict as the number of iterations increases, but when the number of iterations is too large, the approximation degenerates and it is worst than the result obtained with fewer iterations. This phenomenon is known as overfitting.
>
> A possible way to mathematically formalize overfitting, although not the only one, could be to suppose that E( x_{n + i} ) is intended to predict D( x_{n + i} ) for all i = 1, 2, ..., k. In this framework, could be proved that the maximum number of iterations just before overfitting is a non-computable function of the parameters of the problem? I am not excluding that this number could be computable in some particular cases.
>
> Kind regard,
> Jose M.
>
> References:
> [1] Verbeek, Jakob J. "Overfitting using the Minimum Description Length Principle." (2000). https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9142&rep=rep1&type=pdf
>
> [2] Pearlmutter, Barak A., and Ronald Rosenfeld. "Chaitin-Kolmogorov complexity and generalization in neural networks." Advances in neural information processing systems (1991): 925-931.
>
> [3] Dong, Xiao, Jiasong Wu, and Ling Zhou. "How deep learning works--The geometry of deep learning." arXiv preprint arXiv:1710.10784 (2017).
>
> [4] Neyshabur B, Tomioka R, Salakhutdinov R, Srebro N. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071. 2017 May 8.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/fom/attachments/20220121/6db1c5c8/attachment-0001.html>
More information about the FOM
mailing list