 # Inner Workings of ML

This was originally my first Math Internal Assessment, however, I since have change the topic. I didn’t want to have the work go to waste, so I decided to post it here. Though the mathematics don’t gold up to more complex ML models, I feel that this still provides a good look into the inner workings of machine learning.

## Introduction:

In this paper, the inner workings of one of the most revolutionary and pivotal technologies is explored – machine learning. Machine learning is crucial to our everyday life, it forms the foundation of most modern software. Every time you google something, you are feeding a machine learning algorithm; every time you order something online, you are feeding a machine learning algorithm; and every time you visit a social media site, you are feeding a machine learning algorithm. The key output is prediction, and that is what we will be exploring. In this paper we will not only outline how a prediction neural network (a technology that is a subset of the vast umbrella of machine learning) operates and how different effects to its internal wiring, i.e. changing the activation function will affect its predictions. I was incredibly interested in this topic because of how I can use it to optimize certain tasks. For example, how much food is needed on a given day for x students with a certain meal.

However, to understand this topic, you must first understand the basics of what machine learning and neural networks actually are.

## What is Machine Learning:

Firstly, what is it? Essentially, just as before we had human labor which has now for a large part been automated (for example labor intensive farming) and has in a large part been replaced by ‘mechanical muscle,’ neural networks are ‘mechanical minds.’ They are initially like a blank slate, not knowing how to do basically anything except change themselves i.e. they can modify their internal wiring. To ‘train’ in other words ‘teach’ these programs/software/machines/models, we subject them to tests, in other words, training sets.

A good way to illustrate this is with an example, in a video by CGP Grey, he outlines the basics of neural networks with the example of a program that can look at any picture and say if it’s a bee or a three, or neither. To train this model, we get hundreds of thousands of pictures of bees and threes and break them up into training sets (which are kind of like tests). The model (program) then generates many different iterations of itself and does the test, the best model gets to survive, and the rest are discarded. This process is repeated over and over again until you are able to make a model (machine learning model) that can almost always differentiate between a bee and a three. At the moment, it is unknown how it does this because it’s internal wiring is too complex for both the machine itself and the programmer who made it to understand, but effectively the machine has learned how to identify a bee from a three.

## Mathematical Foundations (definition of terminology and overarching view of operation):

Now where does Mathematics fall into this? Well, to understand that, we have to go a bit deeper into how machines learn. It is essentially a function, you give it x amounts of inputs, and it gives you y number of outputs. These are called in input layer and the output layer. In between these two layers, we have what are called hidden layers. These perform all the calculations; the connections between neurons (also known as nodes) get more and more complex as it learns. The following images illustrate a very basic neural network followed by how they get more and more complex:

As can be seen in the above diagrams, there are connections between each node, these are called weights, they act similar to a vector in the sense that a connection can have different magnitudes, i.e. (how much it influenced the next node). Each node has its own input and output. This consists of the output of the previous nodes and their respective weights. These values are put together (will be elaborated mathematically in a later section) and passed through a function which is known as an activation function. This is then used to calculate the output of the node by using the input as the x-axis and the output as the corresponding y-axis.

There are many different types of activation functions, however, the purpose of this Internal Assessment is to specifically evaluate two of them: the Sigmoid function (σ) and the Hyperbolic Tangent function (tanh). Activation functions are just functions (like sin x) which help the machine learn by determining how a certain input affects the output. Each of these influences the neural network differently, specifically, we will be looking at how long it takes for the outputs to converge. This means when the results of the cost function start to form a straight line and their values are close to 0 when plotted out, i.e. the gradient of the results approach 0 AND the points are close to the x-axis (see graph on next page). This effectively means that the error (how far the model’s answers are from the actual answers) of the model is being minimized (approaching zero).

The cost function (may periodically be called a loss function or error function, which are functionally similar) is just the number of errors in that training set. This is evaluated using various ways, a key one being mean squared difference (however that is not as important for this paper).

In the above graph, we can see the gradual reduction in the loss over time; in this diagram, η represents the learning rate of the neural network (not very important for us) whilst the epochs are simply the number of iterations  and finally, the cost is just the value of the cost function (a representation of the number of errors encountered). Here it can be seen that the yellow line is converging over many iterations, and so is the blue line as they’ve both more or else stabilized or are in the process of stabilizing. As mentioned previously, this is the value of the cost function becoming regular.

## Key Components:

As you can see there are four main components to a neural network. First are the different layers, each layer contains a set of nodes which are the fundamental components of a neural network. Each node does a calculation on the input data and passes it on to the next node. The next most important thing is a synapse. Synapses are just like sets that hold all the weights in the network (for that layer). This brings us to weights, there is a unique weight for every connection between nodes. This means that for the above diagram, there are 15 connections between (input layer) Layer 0 ( ) and (hidden layer) Layer 1 ( ) and 10 connections between  Layer 1 ( ) and (output) Layer 2 ( ). Therefore, there are 25 weights in this network. Weights act sort of like the gradient of a function, they may not result in the exact value, but a constant when added to it should get close.

Each each weight is notes with a superscript (does not represent a power) representingwhat layer that weight belongs to; and a subscript representing what nodes that weight is connecting (i.e. node j to node k). Therefore,  represents a weight in Layer 1, connecting node a to node d.

Lastly, there are biases and nodes. Biases are a constant for each layer, unlike a weight which is unique for each connection, there is one biases for each layer. This acts sort of like a y-intercept, not modifying the underlying function too much, but more adjusting its position.

Lastly, there are activation functions, which aren’t shown in the above diagram. An activation function is just a function which takes an input (x) and returns an output (y) and it is essentially how a neural network modifies values. As stated previously, we will only be looking at two activation functions: sigmid and tanh.

When we put this all together, the input for a node’s activation function can be represented by:

Where:

is the input for a node j (at layer L)

is the weight connecting node j (layer L) to the previous node k (layer L-1)

is the output of node k (at layer L-1)

is the bias for layer L

And this as the output for a node:

Where:

is the output of node j  (layer L-1)

is the activation function

The last parameter is the learning rate. This will not be discussed as it is not important for the overall investigation. It is essentially a parameter for how big the gradient decent function steps (smaller being more accurate but slower and less efficient, bigger being faster with larger steps, but can have inaccuracy).

The way this cost function is modified is through an algorithm called gradient descent. Gradient Descent is an optimization algorithm which works on reducing the value of a given cost/loss function. It is able to reduce this cost by changing weights and how much a weight should ‘step’ otherwise known as stepping after every epoch (every iteration) to become more accurate and thereby reduce error rate. This works by trying to reduce values by getting them closer to the minima of the gradient (ie where the gradient of the function is equal to zero). This is more easily illustrated with the following graphs:

Backpropogation:

Backpropagation is almost exactly as it sounds: you propagate backward, (i.e. you trace your steps). In everyday life you ‘backpropagate’ when you reflect on past actions and how it led to your current situation. In a similar fashion neural networks ‘reflect’ on how their ‘past actions’ influenced their output – in this case past actions would signify the manipulation of the inputs. By doing this type of reflection, neural networks “learn” what actions helped them and which didn’t. This thus allows them to optimize themselves and reduce their error.

Backpropagation is therefore the rate of the cost function with respect to a component. Backpropagation is done in reference to two components of the network: the weight, and the bias. Out of the two, the weight is the more complex one to derive, so we will start with that.

To derive the backpropagation algorithm, we need to use partial differentiation in conjunction with the chain and power rules.

I.e., we need to find:

Where:

C is the cost function

is the weight in layer L connecting node j to node k.

Therefore, this statement means:

The effect on the Cost Function C with respect to small changes in the weight in Layer L of the neural network traversing from node j to node k.

Given that there are more than two variables to consider, we need to break the problem up to find the effect of any weight on the cost function.

The following diagram shows all the parameters that effect the output of a node:

The Red Node is on layer L-1, and is node j in that layer. The Green Node is in Layer L (the layer we are evaluating) and is node k in that layer. The function is the activation function which has the parameter z(L) passed through it, where z(L) is defined as:

Think of this statement sort of like a line (y = mx+c where m is the weight, x is the node at j and c is the bias). The weight directly influences how much any given input can change. If the weight is a small number, it implies the input is of less significance; since the weight is like the gradient of a line (y = mx+c where m is the weight or gradient) implying that if the weight is low the gradient is low, therefore making the input change more (if the input is 0 then the output won’t change at all).

Similarly, the bias is like the y-intercept (c) and therefore only makes large changes to the network such as increasing all outputs of that layer by a certain value (ie it has a bias towards a certain number).

aj(L-1) is our x value and is the only input. The inputs are in the end what dictate the output, thus our goal is to make a generalization such that any given input value aligns with what we want as an output (it’s essentially creating an algorithm for us to use).

This makes zk(L) our output (y value) and is just a number containing what needs to go into the activation function. The activation function being used to help the network recognize these patterns and make these generalizations.

Returning back to the equation, this  is then passed through the activation function (for this example, the sigmoid function is shown, however, it can be any non-linear function) to get the output of ak(L) (the output for node k):

This process is repeated until we reach the output node, i.e. the nodes in the output nodes. Therefore, we can see that that:  a change in w creates a change in z which creates a change in a which thus influences a change in the cost function. This can simply be outlined as follows:

Therefore, to find

we can rewrite this statement as:

Which is the:

Effect of a change of  on  multiplied by the effect of a change of on  multiplied with the effect of a change of  on .

(Note that the subscripts have been removed for now to make the equations less cluttered and more readable).

We now evaluate all off these partial derivatives in (2) individually:

Since (1):

Therefore:

Next, we evaluate

The last partial derivative depends on what error function is being used. Mean Square Difference (Error) is one of the simpler ones, therefore, we will be using that.

Mean Squared Error (M.S.E.):

Therefore:

Thus:

Therefore:

To find the effect of that particular weight across the entire training set (all cost functions) we can simply do the following:

We must now also calculate the same thing for the bias. This is a much more simply calculation comparatively as there is only one bias per layers

Since the final two terms are identical to the previously calculated ones, we simply need to find  in order to backpropagate for our weight.

Therefore, by substituting our previous values in (2):

Now that we have understood the tool of back propagation, we can now move on to the actual study of the sigmoid functions vs the hyperbolic tangent function and how it actually effects the convergence of the cost/loss function/s and mathematically understand why.

## Application of Gradient Descent Algorithm:

To apply the gradient descent algorithm, we can do the following:

Where:

This takes all calculated changes for that weight and averages them. As stated previously, when training a network there could be hundreds, thousands, or even millions of input sets. Each of these input sets has a corresponding cost, and therefore, all of them have a cost. Thus, there is one cost function for each input.

This may look very similar as it is almost exactly the formula for Stochastic Gradient Descent (SGD).

We start from i=0 because in computer science, the value of an array or list always starts from index 0, not index 1 (and subsequently, we must make the final index n-1.

## Mathematics of a Neural Network:

Now that we’ve defined the tools we need, let’s build a small neural network and explain the mathematics as we go, then run some tests on the sigmoid activation functions vs the tanh activation function.

First, we must make our training set. In the training set, to keep it simple, we will be taking three inputs, these will be stored in a list – that is one training case. We can have multiple of these stored in our training set. Now, we must define our sigmoid.

Our neural network will have this form:

## Sigmoid Activation Function:

The sigmoid activation function was used specifically because it exists in a range between 0 and 1, thereby making it especially useful for determining probabilities with 0 being 0% and 1 being 100%. This made it a popular activation function due to its simplicity.

There are more reasons as to why it was initially adopted, the primary being that it mimicked a biological neuron specifically in the action potential. Action potential in psychology/biology is the likelyhood of a neuron firing which, due to the shape of the sigmoid, is also reflective here. The sigmoid produces results that can be considered ‘firing’ over a certain threshold (which is defined by the programmer).

The function is defined as follows (also in diagram):

Now we can run some experiments on various training sets and see the effect on how long it takes to stabilize.

To derive the formula for the derivative of the sigmoid we simply do the following:

Using the Chain Rule:

Simplifying:

In our program, we can write it like this:

def sigmoid (x, derivative=False):
if derivitive==True:
value = sigmoid(x)*(1-sigmoid(x))
else:
value = 1/(1+math.exp(-x))

return value


Where if we set the derivative to be True (if the following statement is written: sigmoid(x,True)) then it will run the derivative of the sigmoid, otherwise it will simply run the normal sigmoid function.

## Tanh Activation Function:

Now we can do the same for our hyperbolic tangent function:

Since the tanh function is equal to:

We can simplify it in our code by doing the following:

Now let us find the derivative:

can be further simplified by replacing with

In code it looks like this:

Now that we have out two functions, we can start the more intense mathematics. Normally, when we are training a neural network (giving it tests) we do many of them at once. It would be very time consuming and computationally intensive if we did each thing one by one, so instead we use matrices.

Out first matrix will be called Layer0, as it is the 0th layer in the neural network, ie. Our input layer. We will be using 3 training sets as an example for the mathematics (however, it can range from 1 to infinity).

We now can define our weights. Initially, these are generated randomly and are improved upon as the network learns. Since each connection has two weights, we can define a 2×3 matrix to store. them. We will call this matrix synapse 0, or S0. Therefore,  just means the weight connecting nodes a and d in Layer 0.

To get our matrix for Layer 2, we must first take the dot product of both matrices, and then pass each element though our activation function. Mathematically, it looks like this (as covered previously):

Therefore, we can simply do the following:

Note a, b, c are inputs to the node. These are all different for each input. There can be an infinite number (1 to infinity) of different inputs, we are only going to deal with 3. I will be omitting the input subscripts for simplicity and to reduce clutter from now on.

Since these are our outputs (with each column being a respective node) we can call them nodes d and e for simplicity.

This is denoted in code as follows:

We can now repeat the same thing for Layer 2:

This gives us our final output:

These are the predictions the network has generated, however, now we must train them.

To do this, we need to use our previously derived backpropagation formula:

There are quite a few steps to implement this:

First, we will calculate each component individually, starting with the error for Layer 2:

Where y is the actual results (a matrix of the same dimensions of L2). Now, we can find the delta, which is:

This is simply the dot product of error and Layer 2 (which passed though the derivative of the sigmoid):

Now, to get the change in the weights, we need to find a way to multiply this 3×1 matrix with Layer 1 such that each delta corresponds to the correct node. To do this, we can simply take the Transpose of Layer 1 and multiply it with the delta:

Notice that this is the same dimensions as the synapse 1 matrix (S1) that we defined earlier. We can now perform gradient descent. Gradient descent formula:

We can do this by taking the number of elements in the training set and multiplying it with our previous result as the previous result is already a summation of all the values. This we then subtract from our existing values of the synapse:

We can implement all this in code with these two functions:

Note that activation_function(Layer, True) simply means that we are putting a given Layer in, and then saying that we are taking the derivative (derivative=True). This is simpler if you refer to the earlier code.

Now we must do the same for Synapse 1, first we must calculate the error. This is just the delta multiplied with corresponding weights. This can be calculated by taking the transpose of synapse 1 and multiplying it with the delta of layer 2.

We do this to actually modify all of our weights by multiplying each one with their corresponding delta (change to be applied). We transpose in order to be able to multiply the two matrices as well as to get the in the same form as before.

Now we must complete the rest of our calculations for

Then we multiple the transpose of L0 with delta L1:

This all is written as follows:

If we try to train this on a classification problem (if something is HIGH – 1 – or LOW – 0- ) and plot out the error in the cost function over every epoch/iteration, we can see the following results for the sigmoid function (results from 2 different iterations):

We can clearly see that the overall the sigmoid is decreasing quite rapidly over each iteration and that our mathematics seems to be working. It does have an overall trend downward and in both test cases it increases more rapidly as the epochs/iterations increases.

Now let us test the same dataset with the tanh function:

What we see is that the tanh converges much faster and in general the error is incredibly low. Why is this? When we look at both super imposed, we can clearly see the difference in their convergence rate:

## Analysis and Conclusion:

Compared with the same weights when using the sigmoid function, the adjustments are much more drastic when using tanh. It almost seems as though there is a direct correlation between the magnitude of the delta and the rate of convergence. Since the main difference between the tanh and sigmoid when backpropagating is their derivative, we should investigate the difference between the two:

As we can see, the tanh function covers a much larger range and therefore the gradient descent algorithm can apply greater change to each weight. Therefore, the gradient descent algorithm makes much larger adjustments when compared to using the sigmoid function. Thus, the weights can have greater deltas (allowing the loss functions (the loss) to converge faster and normalize i.e. approach a near 0 value and have a gradient close to 0.

However, this isn’t necessarily a good thing because it is possible for the correction to be too large as compared to the sigmoid function, ie it has a higher probability of passing the minimum (Fig 7 – 8). This, when coupled with the fact that the gradient grows exponentially between certain values ( ~ -2 to +2) means that the gradient could apply irrationally high values at certain points, thus passing the minimum. This also subsequently means that any value outside the aforementioned domain (-2<x<2) will have a very low change applied to it as compared to the sigmoid function.

In our case, tanh was clearly more efficient, with our error converging much faster than sigmoid. However, for something were we don’t want major adjustments (deltas) and only want minor corrections – such as at the end of training a network when the weights are already quite good – tanh can actually be worse as the adjustments could be too much for minor values.

Lastly, there is the vanishing gradient problem where for synapses earlier in the network, as the gradient approaches zero (ie the inputted x value is bigger), there are smaller and smaller changes applied. For a network such as ours, it doesn’t make much of a difference, however, for larger networks with multiple hidden layers, it can become significant. It essentially means that due to the gradient “vanishing” for earlier weights in the network, later weights will also be also be affected as they don’t experience as much change and thus the network stops learning effectively. If we look at the Tanh graph, we can see that its gradient drops off dramatically thereby exacerbating this problem (the larger the value, the faster the gradient vanishes).

# Bibliography

Sanderson, Grant. “But what is a Neural Network? | Deep learning, chapter 1.” YouTube, uploaded by 3Blue1Brown, 5 Oct. 2017, URL (www.youtube.com/watch?v=aircAruvnKk)

Sanderson, Grant. “What is backpropagation really doing? | Deep learning, chapter 3.” YouTube, uploaded by 3Blue1Brown, 3 Nov. 2017, URL (www.youtube.com/watch?v=Ilg3gGewQ5U)

Sanderson, Grant. “Backpropagation calculus | Deep learning, chapter 4.” YouTube, uploaded by 3Blue1Brown, 3 Nov. 2017, URL (www.youtube.com/watch?v=tIeHLnjs5U8)

Spencer-Harper, Milo. “How to Build a Simple Neural Network in 9 Lines of Python Code.” Medium, 21 Jul. 2015, medium.com/technology-invention-and-more/how-to-build-a-simple-neural-network-in-9-lines-of-python-code-cc8f23647ca1.

Ng, Andrew. Syllabus for Neural Networks and Deep Learning, Coursera, URL (https://www.coursera.org/learn/neural-networks-deep-learning)

Kinsley, Harrison and Kulieta, Daniel. “Neural Networks from Scratch – P.3 The Dot Product” YouTube, uploaded by sentdex, 24 Apr. 2020, URL (www.youtube.com/watch?v=tMrbN67U9d4&feature=youtu.be)

Spencer-Harper, Milo. “Simple Neural Network.” GitHub repository, URL (github.com/miloharper/simple-neural-network)

Wang, Chi-Feng. “The Vanishing Gradient Problem: The Problem, Its Causes, Its Significance, and Its Solutions.” Towards data science. URL (towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)

# Appendix

## Code

### Neural Network Code (see bibliography for reference code)

# -*- coding: utf-8 -*-
"""
Created on Fri Jun 19 03:43:07 2020

"""
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd
import random

#Input data
training_sets = np.array([[0, 0, 1],
[1, 1, 1],
[1, 0, 1],
[0, 1, 1]])

,
,
])

#Activation Functions
def sigmoid(x, derivitive=False):
sigmoid.name = "Sigmoid"
if derivitive==True:
value = sigmoid(x)*(1-sigmoid(x))

else:
if x<0:
value = 1 - 1/(1+math.exp(x))
else:
value = 1/(1+math.exp(-x))
#value= 1-sigmoid(-x)
return value

def tanh(x, derivitive=False):
tanh.name = "Tanh"
if derivitive==True:
value = 4*sigmoid(2*x)*(sigmoid(2*x)-1)
#or it could be written as 4*sigmoid(2*x, True)

else:
value= 2*sigmoid(2*x)-1

return value

#Error Functions
def MSE(y_hat, y, derivitive=False):
MSE.name = "Mean Squared Error"
if derivitive:
error = 2*(y - y_hat)
else:
error = (y - y_hat)**2

return error

def BCE(y_hat, y, derivitive=False):
BCE.name = "Binary Cross Entropy"
#Modified Copy Pasted code
if derivitive:
if y_hat == 1:
value =  -1/y_hat
elif y_hat != 1:
value =  1 / (1 - y_hat)
else:
if y == 1:
value = -np.log(y_hat)
else:
value =  -np.log(1 - y_hat)

return value

#BCE written by me - wasn't working
BCE.name = "Binary Cross Entropy"
if derivitive==True:

else:

return value

def Backpropogation(previous_Layer, Layer, error_derivitive, synapse, Learning_Rate, activation_function):
delta = error_derivitive*np.vectorize(activation_function)(Layer, True)*Learning_Rate
weight = previous_Layer.T.dot(delta)

return delta, weight

updated_synapse = synapse + weight#*(1/len(training_sets))

return updated_synapse

#Initializing weights and biases
np.random.seed(1) #Same set/seed of random values each time: allows for fair/consistent testing
syn0 = 2*np.random.rand(3,2)-1 #3 nodes connecting to 2 nodes (each node needs 2 connections)
syn1 = 2*np.random.rand(2,1)-1  #2 nodes connecting to 1 node (output) (each node needs 1 connection)

bias0 = np.random.uniform(-1,1)
bias1 = np.random.uniform(-1,1)

#Setting Error Function, Activation Function, and Learning Rate
Error_func = MSE #MSE is Mean Squared Error, BCE is Binary Cross Entropy
Activation_func = tanh #Sigmoid is sigmoid and tanh is hyperbolic tangent
Learning_rate = 0.5

#Training
error_plotted = []
for _ in range(0,100):

Layer0 = training_sets #Input layer

#   Values for first hidden Layer
temp = Layer0.dot(syn0)+bias0
Layer1 = np.vectorize(Activation_func)(temp)

#   Values for Second/Output Layer
temp = Layer1.dot(syn1)+bias1
#Layer2 = np.vectorize(Activation_func)(temp)

if Activation_func.name == "Tanh":
Layer2 = (np.vectorize(Activation_func)(temp)+1)/2
else:
Layer2 = np.vectorize(Activation_func)(temp)

#   Calculating how far our prediction was from the Answer
error_name = Error_func.name

Layer2_delta,summed_weight = Backpropogation(Layer1, Layer2, error, syn1, Learning_rate, Activation_func) #change to Layer2_error
bias1 = bias1 - -1*np.mean(Layer2_delta)

Layer1_error = Layer2_delta.dot(syn1.T)

Layer1_delta, summed_weight= Backpropogation(Layer0, Layer1, Layer1_error, syn0, Learning_rate, Activation_func)
bias0 = bias0 - -1*np.mean(Layer1_delta)

#Output
if _%10:
print("Epoch ", _+1)
print("=====================================")
print("-------------------------------------")
#print("Error individual:\n",error)
print("=====================================")

plt.plot(error_plotted)
plt.ylabel("Error Value")
plt.xlabel("Epoch")
plt.title(Activation_func.name)
plt.show()


### Graphing Code:

# -*- coding: utf-8 -*-
"""
Created on Thu Jun 25 18:04:06 2020

"""

import matplotlib.pyplot as plt
import math
import numpy as np

def sigmoid(x, derivitive=False):
sigmoid.name = "Sigmoid"
if derivitive==True:
value = sigmoid(x)*(1-sigmoid(x))

else:
if x<0:
value = 1 - 1/(1+math.exp(x))
else:
value = 1/(1+math.exp(-x))
return value

def tanh(x, derivitive=False):
tanh.name = "Tanh"
if derivitive==True:
value = 4*sigmoid(2*x)*(1-sigmoid(2*x))
#or it could be written as 4*sigmoid(2*x, True)

else:
value= 2*sigmoid(2*x)-1
return value

activation_func = tanh
num = 0.74110231
x= np.linspace(-10,10,100)
plt.plot(x,np.vectorize(sigmoid)(x,True), color = 'blue',zorder=0, label="Sigmoid")
plt.plot(x,np.vectorize(tanh)(x,True), color = 'red', label="tanh")
plt.legend()
#plt.scatter(num,activation_func(num, True), zorder=1)"
plt.ylabel("Output")
plt.xlabel("Input")