Training with dropout is very similar to training with normal NN. The difference is each minibatch is the training of a separately new thinned network by dropping out units. Forward and back prop is carried out on this thinned network and globally averages together as unique combinations of units are combined together. Other techniques to improve learning (momentum, L2 weight decay) are found to still be helpful in combination with dropout.

One particular form found to be helpful with dropout is constraining the norm of the output vector by constant *c* which can be tuned as a hyperparameter. $\lVert \bold{w} \Vert_2 \le c$ is the equation of the max norm constraint. Without full mathematical proof as to why this is especially helpful, the one possible justification is that using high learning rate with dropout and max norm constraint allows the model to train without the weights blowing up.

One thing I found very interesting is the interaction of new novel techniques. Typically it seems that one proposed technique that is beneficial, the authors have found other techniques to increase the benefit even further and propose them together. Here the authors have shown that using the max-norm regularization, though not new, might add even more benefit in combination with dropout as they synergistically complement the learning process as limiting the learning rate, and spreading the learning across the entire networks weight matrices.

This is the day I got Katex working, so I was playing around below:

4 inline $a = b + c$

$E = m * c^2$

$y = x^2 + 0.3542$

test inline $\hat{y} = \bold{W}x + b$

$\left(\beta m c^2 + c \left(\sum_{n=1}^3\alpha_n p_n\right)\right) \psi(x,t) = i\hbar \frac{\partial \psi(x,t)}{\partial t}$