Given that the variance of input is gamma and the variance of weight is sigma, this is the mean and variance of the output
Since we need to satisfy unit variance for both forward and backward, we compromise by satisfying this equation
Penalize large weight vectors by adding weight to the overall loss.
1) Use "batch mean" and "batch variance" to normalize each layer. 2) Only use moving mean and moving variance while doing inference. 3) Update moving mean and variance with batch mean and variance (weighted by momentum) 4) Update gamma and beta through back propagation.