Personal Project · Open Source

Gradient Optimizers
Built from Scratch

7 optimizers implemented in pure NumPy — from vanilla Gradient Descent to Adam — benchmarked on convex, ill-conditioned, and non-convex landscapes with full convergence analysis and a PyTorch DL benchmark.

7 Optimizers 4 Landscapes 3 LR Schedules NumPy from Scratch PyTorch Benchmark

Results at a Glance

Key Numbers

7
Optimizers from scratch
Speedup from Momentum on κ=100
98.5%
SGD+Nesterov test accuracy (make_moons)
O(1/k²)
Nesterov rate — confirmed empirically

Algorithms

Seven Optimizers

01
Gradient Descent
x = x − lr · ∇f(x). The baseline. Works well on easy problems, zigzags on ill-conditioned ones.
O(1/k)
02
SGD (mini-batch)
Noisy gradient from random mini-batch. Supports cosine, step decay, and warmup schedules.
Stochastic
03
Momentum
Accumulates velocity — dampens oscillations in ravines. ~5× faster than GD on κ=100.
β = 0.9
04
Nesterov AGD
Evaluates gradient at the look-ahead point. Provably optimal O(1/k²) rate — confirmed empirically.
O(1/k²) optimal
05
AdaGrad
Per-parameter adaptive lr using cumulative squared gradients. Stalls on dense gradient problems.
Adaptive
06
RMSProp
Fixes AdaGrad's vanishing lr via exponential moving average. Stays stable throughout training.
EMA · ρ = 0.9
07
Adam
Momentum + RMSProp + bias correction. Most robust across all landscapes — the safe default.
β₁=0.9 · β₂=0.999

Convergence Results

Experiment Outputs

Ill-conditioned convergence
Figure 1. Ill-conditioned quadratic (κ=100) — suboptimality (left) and gradient norm (right). Adam and Nesterov converge in ~50–80 iterations; GD takes 300+ and still lags behind.
Well-conditioned convergence
Figure 2. Well-conditioned quadratic (κ=2). All optimizers converge — differences are small. Nesterov and Adam reach machine precision earliest.
GD vs condition number
Figure 3. GD convergence vs condition number κ ∈ {2, 5, 10, 50, 100, 500}. Solid lines are empirical; dashed lines are the theoretical bound ((κ−1)/(κ+1))^k. κ=500 does not converge in 1000 iterations.
Logistic regression
Figure 4. Logistic regression loss. Momentum, Nesterov, RMSProp reach ~0.10 within 100 iterations. Adam converges slowly here.
Rosenbrock landscape
Figure 5. Rosenbrock landscape (log scale). The narrow banana-shaped valley curving toward the global min at (1, 1).

Learning Rate & Deep Learning

Schedule & Benchmark Results

LR schedule comparison
Figure 6. SGD with 5 LR schedules on logistic regression. Fixed LR reaches lowest final loss (~0.155). Cosine and Warmup+Cosine reach ~0.20. The 1/√k schedule decays too fast and stalls at ~0.38.
DL benchmark
Figure 7. MLP training and validation BCE loss on make_moons (200 epochs). Adaptive methods converge faster; SGD variants are smoother.
DL accuracy
Figure 8. Test accuracy per optimizer. SGD, Momentum, Nesterov: 98.5%. Adam, RMSProp, AdaGrad: 97.5% — Adam's generalization gap on small datasets.

Takeaways

What the Experiments Showed

📐
Theory matches experiment exactly
GD empirical curves follow the bound ((κ−1)/(κ+1))^k precisely. At κ=500, GD cannot converge in 1000 iterations — theory predicts this.
🏎️
Momentum: 5× faster on κ=100
Adding β=0.9 momentum reduces iteration count from 300+ to ~60 on the ill-conditioned quadratic. One hyperparameter, massive practical effect.
🔭
Nesterov's look-ahead is real
The O(1/k²) optimal rate is confirmed empirically — Nesterov consistently outperforms plain momentum on every landscape tested.
⚠️
AdaGrad stalls on dense gradients
The cumulative squared gradient denominator grows forever, shrinking the effective lr to near zero within ~100 iterations on dense problems.
📅
Fixed LR beats cosine on logistic
Fixed LR reaches ~0.155 final loss vs ~0.20 for cosine annealing — the constant step size keeps descending while schedules slow down.
🎯
SGD beats Adam on accuracy
SGD+Momentum+Nesterov all hit 98.5% on make_moons. Adam/RMSProp/AdaGrad reach 97.5% — the Adam generalization gap is real even on small tasks.

Code

Quick Start

from src.landscapes import make_ill_conditioned_quadratic
from src.optimizers import run_all
import numpy as np, matplotlib.pyplot as plt

landscape = make_ill_conditioned_quadratic(n=10, kappa=100)
results = run_all(landscape, x0=np.zeros(10), n_iter=300)

for name, r in results.items():
    plt.semilogy(r.f_history, label=name)

plt.legend(); plt.xlabel("Iteration"); plt.show()
Python 3.10+ NumPy PyTorch Matplotlib Jupyter