Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, Rob Fergus

Dept. of Computer Science, Courant Institute of Mathematical Science, New York University

No-Drop Network | DropOut Network | DropConnect Network |

Each element of a layer's output is kept with probability \(p\), otherwise being set to \(0\) with probability \(1-p\). If we further assume neural activation function with \(a(0)=0\), such as \(tanh\) and \(relu\) (\(\star\) is element-wise multiplication ): \[ r = m \star a\left( Wv \right) = a\left( m\star Wv \right) \] Training Network with DropConnect:

Generalization of Dropout in which each connection, rather than each output unit, can be dropped with probability \(1-p\): \[ r= a\left( \left( M\star W\right) v\right) \] where \(M\) is weight mask, \(W\) is fully-connected layer weights and \(v\) is fully-connected layer inputs.

Dropout Network Inference (mean-inference): | \( \mathbf{E}_M\left[ a\left( M\star W\right)v \right]\approx a\left(\mathbf{E}_M\left[ \left(M\star W\right) v \right]\right) = a\left(pWv\right) \) |

DropConnect Network Inference (sampling): | \(\mathbf{E}_M\left[ a\left( M\star W\right)v \right]\approx \mathbf{E}_u\left[a(u)\right] \) where \( u\sim \mathcal{N}\left( pWv, p\left(1-p\right)\left(W\star W\right)\left(v\star v\right)\right) \), i.e. each neuron activation are approximated by a Gaussian distribution via moment matching. |

Experiment with MNIST dataset using 2-layer fully connected neural network:

(a)Prevent overfitting as the size of connected layers increase | (b)Varying the drop-rate in a 400-400 network | (c)Convergence properties of the train/test sets |

Evaluate DropConnect model for regularizing deep neural network of various popular image classification datasets:

DataSet | DropConnect | Dropout | Previous best result(2013) |

MNIST | 0.21 | 0.27 | 0.23 |

CIFAR-10 | 9.32 | 9.83 | 9.55 |

SVHN | 1.94 | 1.96 | 2.80 |

NORB-full-2fold | 3.23 | 3.03 | 3.36 |

Performance comparison between different implementation of DropConnect layer on NVidia GTX 580 GPU relative to 2.67Ghz Intel Xeon (compiled with -O3 flag). Input and output dimension is 1024 and mini-batch size is 128 (You might not get exactly the same number with my code on your machine):

Implementation | Mask Weight | Total Time(ms) | Speedup |

CPU | float | 3401.6 | 1.0 X |

CPU | bit | 1831.1 | 1.9 X |

GPU | float(global memory) | 35.0 | 97.2 X |

GPU | float(tex1D memory) | 27.2 | 126.0 X |

GPU | bit(tex2D memory) | 8.2 | 414.8 X |

Thus, efficient implemention: 1) encode connection information in bits 2) Algined 2D memory bind to 2D texture for fast query connection status. Texture memory cache hit rate of our implementation is close to \(90\%\).

- \(p=0\): the model complexity is zero, since the input has no influence on the output.
- \(p=1\): it returns to the complexity of a standard model.
- \(p=1/2\): all sub-models have equal preference.

CUDA code (code Sep-20-2013 update changelog )

Zygmunt from FastML has successfully reproduce experiment result on CIFAR-10 on Kaggle CIFAR-10 leadearbord in his artical Regularizing neural networks with dropout and with DropConnect.

A summary of question and my answer for hacking my uncleaned code is Here.