non-computability of overfitting

Sam Sanders sasander at me.com
Mon Jan 24 06:16:48 EST 2022


Dear FOM,

I believe the following would be interesting/useful here:

On one hand, the spectral gap theorem from quantum mechanics is not decidable (see https://arxiv.org/abs/1502.04573).

On the other hand, when I talked to the first author of that paper some time ago, he essentially told me the following:

“To show undecidability, we derived the Halting problem from a solution to the spectral gap theorem.
However, we showed this solution to a number of physicists, who all said that this kind of solution
would never be considered in physics: only much nicer objects are generally considered/needed/used”

The question is thus whether the below ML result is of the same kind, or whether 

“ML problems that appear in practise”

are undecidable.

Best,

Sam

PS: I realise the quoted class is somewhat vague.  




> On 21 Jan 2022, at 17:07, José Manuel Rodríguez Caballero <josephcmac at gmail.com> wrote:
> 
> Dear FOM members,
> In the field of (supervised) machine learning, we consider two finite sequences x_train = (x_1, ..., x_n) (features) and y_train = (y_1, ..., y_n) (labels). We are interested in inferring the rule to obtain y_i from x_i. A possible way [1] to formalize the notion of learning in the framework of Kolmogorov complexity [2] could be by identifying "learning" with the smallest algorithm D such that D(x_i) = y_i for all i = 1, 2, ..., n. This algorithm is called the decompressor of x_train and y_train. Then, given another sequence x_test = (x_{n+1}, x_{n+2}, ..., x_{n+k}), the predictions of this learning process is the sequence y_test = ( D(x_{n+1}), D(x_{n+2}), ..., D(x_{n+k}) ). The practical problem is that the function associating x_train and x_test to D is non-computable.
> 
> What is usually done by machine learning engineers is to consider a space [3, 4] containing the algorithms E such that E(x_i) approximates y_i for all i = 1, 2, ..., n and smoothly moves from one of such algorithms (point) to a neighbor algorithm (point) to improve the approximation. The improvements of the approximation are given by the iterations of a method, e.g., gradient descent. It was empirically observed that, in practical situations, when considering the dataset for testing x_test = (x_{n+1}, x_{n+2}, ..., x_{n+k}), the predictions E( x_{n + i} ) increasingly approximate the values that they are intended to predict as the number of iterations increases, but when the number of iterations is too large, the approximation degenerates and it is worst than the result obtained with fewer iterations. This phenomenon is known as overfitting.
> 
> A possible way to mathematically formalize overfitting, although not the only one, could be to suppose that E( x_{n + i} ) is intended to predict D( x_{n + i} ) for all i = 1, 2, ..., k. In this framework, could be proved that the maximum number of iterations just before overfitting is a non-computable function of the parameters of the problem? I am not excluding that this number could be computable in some particular cases.
> 
> Kind regard,
> Jose M.
> 
> References:
> [1] Verbeek, Jakob J. "Overfitting using the Minimum Description Length Principle." (2000). https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9142&rep=rep1&type=pdf
> 
> [2] Pearlmutter, Barak A., and Ronald Rosenfeld. "Chaitin-Kolmogorov complexity and generalization in neural networks." Advances in neural information processing systems (1991): 925-931.
> 
> [3] Dong, Xiao, Jiasong Wu, and Ling Zhou. "How deep learning works--The geometry of deep learning." arXiv preprint arXiv:1710.10784 (2017).
> 
> [4] Neyshabur B, Tomioka R, Salakhutdinov R, Srebro N. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071. 2017 May 8.
> 



More information about the FOM mailing list