FRESH

Hacker News

Modern Optimizers – An Alchemist's Notes on Deep Learning

38 points by maxall4

by derbOac

0 subcomment

Interesting read and interesting links.
The entry asks "why the square root?"
On seeing it, I immediately noticed that with log-likelihood as the loss function, the whitening metric looks a lot like the Jeffreys prior or an approximation (https://en.wikipedia.org/wiki/Jeffreys_prior), which is a reference prior when the CLT holds. The square root can be derived from the reference prior structure, but also has the effect in a lot of modeling scenarios of scaling things proportionally to the scale of the parameters (for lack of a better way of putting it; think standard error versus sampling variance).
If you think of the optimization method this way, you're essentially reconstructing a kind of Bayesian criterion with a Jeffreys prior.