Naftali Tishby: https://www.youtube.com/watch?v=utvIaZ6wYuw Basic summary: if one traces the mutual information of the different layers of a neural net with respect to (i) the input and (ii) the output, then over the course of training, the lower layers have high mutual information with both (but are too complicated to be used for the output) and the higher layers have high mutual information with the output but not so much with the input. This reflects the higher layers' ability to throw away information about the highly entropic input. However, if there is too little data then what happens is that the higher layers never gain good information about the output. In fact, it's worse than that (minute 26 of video): the mutual information with the output decreases with more training for all layers including the last one. This implies that the generalization error increases with more training. There is some tradeoff between mutual information with input and mutual information with the output. I think this is related to Arimoto Blahut 34:40 following When the gradients of stochastic gradient descent become noisy and small, then we are getting better generalization (that is his conjecture). 49:00 following Hidden layers avoid critical slowing down. 50:00 following. Even better talk: https://www.youtube.com/watch?v=gqCztk6hi4Y Can one evaluate the mutual information and see that it goes down with more training and then that tells us we have too little data? Can we use this information theoretic approach to understand why adversarial attacks work? The kind of attacks where one modifies a picture of a penguin slightly and then it's misclassified to a stapler. Also more layers converge slower at the beginning and then better at the end. Anil question: Why does testing error keep going down? Not only for neural nets but also https://arxiv.org/pdf/1410.8516.pdf for random forest. Dear Anil, Let us say that over the course of training, one could monitor the mutual information as well as the error of the various layers of a neural net with respect to the target. Perhaps one could use the techniques of NICE: NON-LINEAR INDEPENDENT COMPONENTS ESTIMATION, but let's say there is some method. In the iid case, this could tell us whether more training can help depending on the data one has and whether we can guarantee some error rate. For example, if we find that the error steadily decreases with more training then, based on Tishby's talk, we have enough data and should keep training until we get good generalization (i.e. we don't overfit). On the other hand, if the error starts to increase, then we don't have enough data and we might be overfitting.