The Backpropagation with Gradient Descent
Introduction

Backpropagation is a backward propagation of errors and is a powerful tool of the deep learning. With the Gradient Descent the backpropagation reduce the cost function and the time of execution. We now talk about of calculate the Gradient Descent.
Gradient Descent
With the Gradient Descent we want find the weights that minimize the errors, the cost function, through some iterations for search the minimum. There are some method, we see the principal, the Batch Gradient Descent, the Stochastic Gradient Descent and Mini-Batch Gradient Descent. The first, the Batch Gradient Descent, is a deterministic method that start always with the same data and produce the same outcome. It calculate the cost function of all the input data and then update the weights through the Backpropagation. This process is very expensive in time and resources for load all data in memory, especially as data gets big, for found the best cost function. The Stochastic Gradient Descent, shortened SGD, is a stochastic method because the outcome is not always the same. With SGD calculate the cost function of 1 input data and then upgrade the weights. This for every input data. This method is faster because not need of very resources and its most used when we have very input data.
The Mini-Batch Gradient Descent is recent and its balance between the first 2 method. We get derivative for a ”small” set of points, tipical mini batch size is 16 or 32, then update the weights and backpropagate it.
An Epoch refers to a single forward pass through all of the training data:
- in Batch Gradient Descent there is 1 step for epoch
- in Stochastic Gradient Descent there are n steps for epoch were n is the training set size
- in Mini-Batch Gradient Descent there are n steps for epoch were n is equal to training set size/batch-size(16 or 32)
Naturally epoch is another important parameter for Artificial Neural Network.
the epoch
For prevent overfitting there are some tecnics call Optimizer. One of this is DROPOUT and most popular are ADAM and RMSPROP. The optimizer are variants to update the weights to give a better performance of Neural Network. RMSProp is a adaptive learning rate method is a variant of Adagrad method for upgrade the weights Its modulates the learning rate of each weight in base of their gradient value equalizing the effect.
Adam is a new method and most utilised. Its a variant of RMSProp. With Dropout we avoid overfitting reducing the number of nodes setting to 0 the value of nodes selected randomly. So the hidden layers are in lower number respect the input layers based to dropout parameter inserted into the Neural Network.
the dropout graph
the gradient graph
the 2d graph
the 3d graph

In the above image the graph of search of minimum cost function from 2 to 3 dimensional space.
The formula for calculate the cost function for every input data and for all the input data:

the cost function
Example of calculation of the cost function
the cost function
the cost function
In this section we calculate the cost function of a simple linear regression function Y = aX + b with the Adjusted R-Squared method.
The Adjusted R-Squared is a method for calculate the error of the predictions value of a regression compared to the actuel value. Its Adjusted because the squared value is always positive so its divided for 2.
In this example the value of coefficients with a minimum cost function is a = 1 and b = 1 and the model for the regression is y = x + 1 .
In this example we simulated three cases with the coefficients a and b (1,1), (2,1) and (2,2). The results is in the column P1, P2 and P3. Then we calculate their cost function in column C1, C2 and C3. Doing the sum we have the total cost function and we take the coefficients with the minimum cost function. if we use the Backpropagation we take a part of the errors to select the Learning Rate that is a parameter very important of the Machine Learning. We select a very small Learning Rate (max 0.05) and update the weights.
Here in the schema we show the difference between the Batch Gradient Descent and the Stochastic Gradient Descent. In the SGD after calculate even cost function we update the weights while in the Batch we update the weights only after we calculate the total cost function. In Mini-Batch Gradient Descent we divide the training data into block with the batch size dimension. After we calcolate the cost function of one block and upgrade the weights.
Computational Graph
the computational graph
For calculate the Gradient we introduce the Computational Graph. The Computational Graph is a method for represent a process in some steps, a data flow graph, were represent the operation in the chart. Each step corresponds to a simple operation. There are some inputs and produces some output as a function. In the image the Computational Graph. The calculation of the local gradient is influenced by the upstream gradient in every step into the Backward Pass process until the final cost function.
The gradient calculated come back to the hidden layers and then upgrade the weights and the process restart with the Forward Pass.
Example 1
The Computational Graph is used for FeedForward and for BackForward. Now a simple example of a Computational Graph that represent a function. We calculate the value of the function in black and the gradient in red with the rules show in the image.
The function f(x,y,z) = f(x+y)z with x = 3, y = -4, z = 7 and we insert w = (x + y)
gradient
Local Gradient
gradient
In the graph we can see how calculate the gradients with the local gradient and the Chain Rule. The Chain Rule tells us how find the derivative of a composite function.
The MULTIPLY gate return us the gradient with the following rule:
- the gradient of x is equal to upstream gradient multiply by y value in the Forward Pass
- the gradient of y is equal to upstream gradient multiply by x value in the Forward Pass.
The ADD gate return us the same gradient value of upstream gradient to x and y gradient.
The MAX gate return us the gradient with the following rule:
- who has the greater value in Forward Pass between x and y take the gradient equal to the upstream and the other take the value equal to zero.
Upgrade the Weights
Now we have calculate the gradient and in this example we upgrade the weights with the following formula:
w_new = w_old - α * derivative
were w_old is the old weights, α is Learning Rate and derivative is the partial derivative(gradient) of every input layers
w_old are w1_old = 2, w2_old = 3, w3_old = 1 and learning_rate = 0.05
w1_new = 2 - (0.05*7)
w2_new = 3 - (0.05*7)
w3_new = 2 - (0.05*-1)
Now we put the new weigths in another iteration while we fit our model.
Example 2
Now another example of Computational Graph for calculate the Forward Pass and the Backward Pass for a Neural Network. In graph the formula of this example is the Sigmoid function sigmoid of X were X is equal of function of the regression y = (β 0 + β 1x 1 + β 2x 2 + ... + β nx n)
In practice we represent a logistic regression how you can see in the project of the Bank Marketing. Here we have only 2 input layer X1 and X2 but the process for calculate the Backpropagation is the same. In the graph in red the value of gradient from right to left that with the Backpropagation we update the weights of Neural Network.
gradient example