Gradient Descent Intuition

Gradient descent is the default starting point for a lot of machine learning training loops, but the most useful intuition is still geometric: you are repeatedly moving parameters in the direction that most quickly reduces loss in the local neighborhood.

Core idea

For parameters theta and loss L(theta), the update rule is:

theta <- theta - eta * grad(L(theta))

The gradient points toward the steepest increase, so subtracting it moves you toward local decrease. The learning rate eta controls how far you step each iteration.

Why it matters

Three implementation details usually matter more than the slogan:

The scale of your features changes how stable the updates are.
A bad learning rate can look like a model problem when it is really just optimization instability.
Stochasticity is not only noise; it can also help escape poor sharp regions.

What I care about in practice

When I read a paper or reproduce a result, I usually track:

the optimizer
the learning-rate schedule
batch size
warmup usage
gradient clipping

Those choices often explain training behavior better than architecture diagrams.