aPaperADay
2 Mishkin Initialization

The problem of a bad initialization in deep networks is the activation and/or gradient magnitude in final layers. This is noted from the Kaiming Paper. Mishkin says the scaling of each layer on the input is k^L, where L is a number of layers. Values of k > 1 lead to “extremely large values of output layers, k < 1 leads to a diminishing signal and gradient.
There are optimal scaling factors, such as SQRT(2) specifically for ReLU.

Previously proposed methods:

  1. Sussillo & Abbot (2014) Random walk initialization which keeps constant the log of the norms of the backpropagated errors.
  2. Hinton (2014) and Romero (2015) knowledge distillation and Hints initialization
  3. Srivastava (2015) gating scheme (like LSTM) to control information and gradient flow.
  4. Saxe (2014) orthonormal matrix initialization.
  5. Bengio (2007) layer-wise pre-training

One subject I’m unfamiliar with is the “orthonormal matrix”: Don’t have time (actually brainpower) to get to it today. I’ll have to look at it on Monday.