📐
Theory matches experiment exactly
GD empirical curves follow the bound ((κ−1)/(κ+1))^k precisely. At κ=500, GD cannot converge in 1000 iterations — theory predicts this.
🏎️
Momentum: 5× faster on κ=100
Adding β=0.9 momentum reduces iteration count from 300+ to ~60 on the ill-conditioned quadratic. One hyperparameter, massive practical effect.
🔭
Nesterov's look-ahead is real
The O(1/k²) optimal rate is confirmed empirically — Nesterov consistently outperforms plain momentum on every landscape tested.
⚠️
AdaGrad stalls on dense gradients
The cumulative squared gradient denominator grows forever, shrinking the effective lr to near zero within ~100 iterations on dense problems.
📅
Fixed LR beats cosine on logistic
Fixed LR reaches ~0.155 final loss vs ~0.20 for cosine annealing — the constant step size keeps descending while schedules slow down.
🎯
SGD beats Adam on accuracy
SGD+Momentum+Nesterov all hit 98.5% on make_moons. Adam/RMSProp/AdaGrad reach 97.5% — the Adam generalization gap is real even on small tasks.