With the Gradient Descent we want find the weights that minimize the errors, the cost function, through some iterations for search the minimum. There are some method, we see the principal, the Batch Gradient Descent
, the Stochastic Gradient Descent
and Mini-Batch Gradient Descent
. The first, the Batch Gradient Descent, is a deterministic method that start always with the same data and produce the same outcome. It calculate the cost function of all the input data and then update
the weights through the Backpropagation. This process is very expensive in time and resources for load all data in memory, especially as data gets big, for found the best cost function. The Stochastic Gradient Descent, shortened SGD, is a
stochastic method because the outcome is not always the same. With SGD calculate the cost function of 1 input data and then upgrade the weights. This for every input data. This method is faster because not need of very resources and its most
used when we have very input data.
The Mini-Batch Gradient Descent is recent and its balance between the first 2 method. We get derivative for a ”small” set of points, tipical mini batch size is 16 or 32, then update the weights and backpropagate it.
An Epoch refers to a single forward pass through all of the training data:
- in Batch Gradient Descent
there is 1 step for epoch
- in Stochastic Gradient Descent
there are n steps for epoch were n is the training
- in Mini-Batch
there are n steps for epoch were n is equal to training set size/batch-size(16 or 32)
is another important parameter for Artificial Neural Network.
For prevent overfitting there are some tecnics call Optimizer. One of this is DROPOUT and most popular are ADAM and RMSPROP. The optimizer are variants to update the weights to give a better performance of Neural Network. RMSProp is a
adaptive learning rate method is a variant of Adagrad method for upgrade the weights Its modulates the learning rate of each weight in base of their gradient value equalizing the effect.
Adam is a new method and most utilised. Its a variant of RMSProp. With Dropout we avoid overfitting reducing the number of nodes setting to 0 the value of nodes selected randomly. So the hidden layers are in lower number respect the input
layers based to dropout parameter inserted into the Neural Network.
In the above image the graph of search of minimum cost function from 2 to 3 dimensional space.
The formula for calculate the cost function for every input data and for all the input data: