[0:00] welcome to this basic video about
[0:02] artificial neural networks
[0:05] a neural network consists of input nodes
[0:09] a hidden layer
[0:11] and output nodes
[0:14] for example these input nodes may
[0:16] represent three measurements that we do
[0:19] to determine if someone has prostate
[0:22] cancer or not
[0:24] suppose that the person enters the
[0:27] hospital
[0:28] we check the age
[0:30] and the concentration of the prostate
[0:32] specific antigen from a blood sample
[0:35] and collect a score between 1 and 5 from
[0:38] an MRI scan of the prostate
[0:41] then we plug in the corresponding values
[0:43] in the input nodes
[0:46] and use the network to tell if the
[0:48] person has prostate cancer or not
[0:51] to understand how neural Nets work we
[0:54] will here have a look at the super
[0:56] simple example
[0:58] at the end of this video we'll try to
[1:00] understand what happens if you include a
[1:02] hidden layer
[1:04] a look at some simple R code if you like
[1:06] to reproduce the first example shown in
[1:08] this video
[1:11] we will here see how to train a neural
[1:13] network to predict if someone has
[1:15] prostate cancer based on the PSA level
[1:18] our training data consists of seven
[1:21] patients that we know have prostate
[1:23] cancer
[1:24] and seven individuals that we know are
[1:27] healthy based on a blood sample
[1:30] we have determined the concentration of
[1:32] PSA or the prostate specific antigen in
[1:35] the blood of these individuals
[1:38] note that this data has been simulated
[1:40] for the purpose of this video
[1:43] we'll here use the simplest possible
[1:45] neural network model
[1:48] since we only have one measured variable
[1:50] the PSA concentration the network will
[1:53] only have just one input node
[1:56] note that we here do not have any hidden
[1:58] layer
[1:59] because we like to use the simplest
[2:01] possible Network
[2:03] the network has two output nodes because
[2:06] we here like to use the network to
[2:08] predict if someone has prostate cancer
[2:10] or not
[2:12] this is called a bias which is used to
[2:15] modify the activation function
[2:18] there are several different activation
[2:20] functions to select between
[2:23] we will here use the sigmoid function
[2:25] which is the same function that is used
[2:27] in logistic regression
[2:29] this will later allow us to compare the
[2:32] neural network with logistic regression
[2:35] let's make a plot of this data
[2:39] like this
[2:41] where these points represent the cancer
[2:44] patients that are hair coded as once
[2:48] whereas these points represents the
[2:50] healthy individuals that are coded as
[2:53] zeros
[2:54] for example this healthy individual has
[2:57] a PSA level of 2.5
[3:01] whereas this cancer patient has a PSA
[3:03] level of 2.1
[3:07] training a neural network means that we
[3:10] find the optimal values of the weights
[3:14] and the weight associated with the
[3:15] buyers
[3:17] so that the activation function in this
[3:19] case
[3:21] generates a sigmoid curve that can be
[3:24] used to predict the outcome in an
[3:25] optimal way
[3:27] note that the logistic activation
[3:29] function generates a value between 0 and
[3:32] 1.
[3:34] and that e is the Euler's number
[3:37] if you train this simple Network we will
[3:40] obtain the following weights
[3:43] let's use this network to see if this
[3:46] healthy person is correctly predicted as
[3:49] being healthy by the network
[3:52] we plug in its PSA level of 2.0 in the
[3:55] input node
[3:57] to calculate the probability that the
[3:59] person has cancer
[4:02] we use the weights associated with
[4:04] output node
[4:06] in this equation
[4:08] where W cro is the bias weight
[4:11] and this is the way for the input signal
[4:14] whereas this is the value of the input
[4:16] node
[4:18] since we only have one weight that is
[4:20] associated with the output node n is
[4:23] here equal to 1.
[4:25] let's plug in the weights in the
[4:27] equation
[4:29] and the value of the input node
[4:32] and do the math
[4:34] we now plug in this value in the
[4:37] logistic activation function
[4:39] and calculate
[4:41] this value is the value for the output
[4:44] node cancer
[4:45] which can in this case be seen as the
[4:47] probability that the person has prostate
[4:50] cancer
[4:52] let's place this value here
[4:55] we'll now calculate the value of this
[4:57] output node
[4:59] where we plug in the corresponding
[5:01] values in the equation
[5:03] and do the math
[5:06] let's place the corresponding value here
[5:08] which can be seen as the probability
[5:10] that the person is healthy
[5:13] we can now use some can of threshold to
[5:16] determine if you should classify the
[5:17] individual to have prostate cancer or
[5:19] not based on these two values
[5:23] a common kind of value to use in binary
[5:25] classification is 0.5 because it is the
[5:29] value between 0 and 1.
[5:32] based on this cutoff value the network
[5:34] will predict that the person with the
[5:37] PHA level of 2 is healthy or cost output
[5:40] value associated with the healthy class
[5:43] is greater than 0.5
[5:46] note that this value is exactly the same
[5:49] value
[5:50] as if we would draw a vertical line from
[5:53] 2 to the logistic curve
[5:56] and then the horizontal line from the
[5:58] curve to the y-axis
[6:00] the height of the activation function is
[6:02] therefore equal to 0.438
[6:06] this value is then simply just 1 minus
[6:10] 0.438
[6:13] this value is equal to the height of the
[6:15] flipped curve
[6:18] because the slope of the curve should be
[6:20] negative if you go this way in the
[6:22] network
[6:24] if you have seen my previous videos
[6:26] about logistic regression you will note
[6:28] that such a model reproduced exact same
[6:31] results
[6:33] the estimated parameters of the logistic
[6:35] regression model
[6:37] are identical to the weights in this
[6:40] simple neural network
[6:42] P0 or The Intercept corresponds to the
[6:46] bias
[6:47] whereas B1 corresponds to the weight
[6:50] associated with the output node for
[6:52] prostate cancer
[6:55] if we plug in this equation into the
[6:57] activation function
[7:00] we will obtain the exact same function
[7:02] that is used in logistic regression
[7:06] we can therefore conclude that this
[7:07] simple neural network using the logistic
[7:10] activation function
[7:12] is identical to logistic regression
[7:16] let's see how well this neural network
[7:18] can predict the class of our training
[7:21] data
[7:22] this is the data that we used to train
[7:25] the network and this is the values
[7:27] generated by the output node for cancer
[7:30] which can be obtained if you plug in
[7:32] these values in the equation
[7:36] and the weights
[7:37] and then compute the corresponding
[7:39] values by the activation function
[7:42] for example this is the value that we
[7:45] calculated by hand previously based on
[7:47] the PSA level of 2.0
[7:50] which corresponds to the height of the
[7:52] activation function
[7:54] we can now use the output values and the
[7:57] cutoff values 0.5 to see how well the
[8:00] model predicts the training data
[8:02] since these individuals had values
[8:05] greater than 0.5 they are predicted to
[8:08] have prostate cancer
[8:09] since you know that these individuals
[8:12] actually have prostate cancer we know
[8:14] that the neural network has made the
[8:16] correct predictions for these
[8:18] individuals
[8:19] however the network incorrectly predicts
[8:22] that this cancer patient is healthy
[8:25] these individuals have output values
[8:28] that are less than 0.5
[8:30] which means that they are predicted to
[8:32] be healthy since we know that these
[8:35] individuals are healthy we know that
[8:37] they have been correctly classified by
[8:39] the network
[8:41] this healthy person is incorrectly
[8:43] classified as having prostate cancer
[8:45] because it has a relatively high PSA
[8:48] level
[8:49] in total the network makes 12 correct
[8:52] predictions out of 14 possible
[8:55] which gives an accuracy of about 86
[8:57] percent
[8:59] however to get the fair estimate of how
[9:02] well the network would perform on new
[9:04] data
[9:06] we should use a test data set or
[9:08] cross-validation as we have done in all
[9:11] other machine learning methods that we
[9:13] have discussed so far
[9:15] watch the video about validation to see
[9:18] how to evaluate the classifier by the
[9:20] holdout method or by cross-validation
[9:23] once they have established a neural
[9:25] network that has been trained on some
[9:27] data we can use it for prediction
[9:32] suppose that we like to test if this
[9:34] person has prostate cancer or not
[9:37] from a blood sample
[9:39] we measure the PSA concentration to 1.75
[9:43] nanograms per ml
[9:45] we plug in the value in the network
[9:48] we can then compute the values of the
[9:50] output nodes
[9:52] we therefore plug in the value of the
[9:54] input node here
[9:57] the weight of the buyers
[10:00] and this weight here
[10:03] and do the map
[10:05] then we plug in this value in the
[10:07] activation function
[10:09] and do the math again
[10:11] 1 minus 0.281 is 0.719
[10:17] since this value is larger than the
[10:19] cutoff value of 0.5 the network predicts
[10:23] that the person is healthy
[10:25] we'll now try to understand where the
[10:28] values for the weights come from
[10:31] in this example these are the optimal
[10:34] values for the weights
[10:36] whereas these are the optimal values for
[10:39] the bias weights
[10:42] the weights are optimized by using some
[10:45] sort of cost function
[10:46] usually the maximum likelihood method
[10:49] for binary classification that we have
[10:51] covered in the videos about logistic
[10:53] regression
[10:54] the following weights were optimized by
[10:57] the negative log likelihood function
[11:01] we can also use the method of ordinary
[11:03] squares I will here explain the concept
[11:06] of ordinary squares because it is a bit
[11:09] simpler to understand
[11:12] in the videos about logistic regression
[11:14] I show how to calculate the log
[11:16] likelihood
[11:17] and in another video I share the
[11:20] difference between orderly squares and
[11:22] the maximum likelihood method based on
[11:24] the simple linear model
[11:27] when we use the method of lead squares
[11:29] we try to minimize this function that
[11:32] computes the sum of the squared errors
[11:35] these are the Y values of the
[11:37] observations which are in this example
[11:40] equal to zero if it is a healthy person
[11:45] and one if it is a patient with prostate
[11:48] cancer
[11:49] y hat is the value of the reactivation
[11:52] function that can go between 0 and 1.
[11:55] this difference is called a residual or
[11:59] an error which can be seen as the
[12:01] distance between observations and the
[12:04] Curve
[12:05] the Y value of this data point is zero
[12:09] and the value of the curve on the PSA
[12:11] level is equal to 2.5 is 0.75
[12:16] which can be calculated like this based
[12:19] on the weights we previously estimated
[12:22] this results in a residual of negative
[12:24] 0.75
[12:27] we then Square this residual
[12:30] we then do the same calculations for the
[12:33] next data point and so forth
[12:35] then we sum all the square residuals
[12:38] which here results in a value of 1.78
[12:43] suppose that we now change the value of
[12:45] the bias from negative 5.754
[12:50] to negative 8.
[12:53] that will move the curve a bit to the
[12:56] right
[12:57] which will increase the sum of squared
[12:59] errors because the data points are now
[13:02] much farther away from the Curve
[13:05] let's paste the point here which
[13:07] represents that the sum of squared
[13:09] residuals or errors is 2.88 and the bias
[13:13] is set to negative 8.
[13:15] when the bias is set to negative 5.754
[13:20] the sum of the squared errors was equal
[13:22] to 1.78
[13:25] if you set the bias to negative 3
[13:28] we'll move the curve a bit to the left
[13:32] which will result in that the sum of the
[13:34] squared errors is equal to 3.62
[13:39] if we would try many different values of
[13:41] the buyer's weight
[13:42] we will be able to generate a curve that
[13:45] shows at the sum of the squared errors
[13:48] changes we will change the value or the
[13:51] weight associated with the bias
[13:54] the method of least squares finds the
[13:57] value of the weight that results in the
[13:59] lowest possible sum of squared errors
[14:03] this explains whether weight of the bias
[14:05] is equal to about negative 5.8 in this
[14:08] example because that value results in a
[14:12] curve that is as close as possible to
[14:14] the data points which we'll result in
[14:16] that the network predicts the prostate
[14:18] cancer in the best way
[14:22] to find the value of the weight that
[14:24] results in the lowest sum of squared
[14:25] errors
[14:27] we need to start with an initial guess
[14:30] suppose that we initially guess the
[14:32] value or the weight to negative three
[14:35] then we use an algorithm such as the
[14:37] gradient descent where we step by smooth
[14:40] along the direction of the steepest
[14:42] descent
[14:43] like this
[14:46] If instead initially guess the value of
[14:48] the bias to negative 9.
[14:52] the gradient descent method will instead
[14:54] increase the value of the weight until
[14:56] it finds the value that minimizes the
[14:58] sum of the squared errors
[15:01] one problem occurs when we had a more
[15:04] complicated Network because the error
[15:06] function for such a network would then
[15:09] have several local minimals
[15:12] this means that if initially gets the
[15:14] value of the bias to negative 9
[15:17] a method will move down to the local
[15:19] minimum and Report negative 8 as the
[15:22] best value
[15:24] which will result in a bad fit to the
[15:26] data
[15:28] if you initially guess the value to
[15:30] negative 6.5
[15:32] or negative 3.5
[15:34] then we will reach the global minimum
[15:39] which results in the best fit
[15:42] it is therefore important that we try
[15:44] many initial guesses or the weights
[15:46] during the learning process to find a
[15:48] network that can predict the data in the
[15:51] best way
[15:52] most software tools Generate random
[15:54] initial guesses or the weights
[15:57] which means that you might get different
[15:58] results every time you train your
[16:01] network
[16:02] we can make a similar error curve if we
[16:05] change this weight
[16:07] when we change this weight the Curve
[16:09] will change its slope
[16:12] when we estimate to weight
[16:13] simultaneously the error function will
[16:16] correspond to surface in a
[16:19] three-dimensional plot like this
[16:21] the combined values of the two weights
[16:24] that result in the minimum value of this
[16:26] function will result in the best fit to
[16:28] the data
[16:30] we'll now try to understand the purpose
[16:32] of using a hidden layer in neural
[16:34] networks
[16:35] to show the beauty of the Hidden layers
[16:39] suppose that we have the following data
[16:41] where one has measured a certain protein
[16:44] in blood samples from 10 cancer patients
[16:47] and five Health individuals
[16:50] people who are healthy have an
[16:52] intermediate level of this protein in
[16:55] their blood
[16:56] or as people with prostate cancer either
[16:58] have a low level
[17:00] or a high level
[17:02] if you would try to fit a logistic
[17:04] regimal would trade a neural network
[17:07] with a logistic activation function
[17:09] without a hidden layer
[17:12] maybe failed to make a good prediction
[17:14] of the training data because an s-shaped
[17:17] curve is simply not useful for this type
[17:19] of data
[17:21] these cancer patients will incorrectly
[17:23] be predicted to be healthy by them all
[17:27] and if you flip the curve these patients
[17:29] will instead incorrectly be predicted to
[17:31] be healthy
[17:33] the car might also look like this
[17:37] where all the healthy individuals will
[17:39] now be incorrectly predicted to have
[17:41] prostate cancer
[17:43] what we want is a car that looks like
[17:45] this
[17:47] this is where the hidden layer comes in
[17:49] because a neural network with one or
[17:52] several hidden layers can generate cool
[17:54] curves like this
[17:56] we'll here extend our previous network
[17:59] with one hidden layer
[18:02] if you train this network
[18:04] on the following data
[18:07] will obtain the following values or the
[18:09] weights
[18:10] node if you try to compute the neural
[18:13] network on the same data you will end up
[18:16] with different values compared to the
[18:17] one shown here because you will probably
[18:20] end up on a different local minimum to
[18:23] understand the calculations behind this
[18:25] neural network let's see how it predicts
[18:27] a person with a value of 3 or protein X
[18:31] the value of this node
[18:34] can be calculated like this
[18:36] which results in a value that is
[18:38] approximately equal to one
[18:42] let's set the value of this node to 1.
[18:45] the corresponding value of this node is
[18:47] 0.869
[18:51] the value of this output node is
[18:53] calculated based on
[18:55] the bias
[18:57] and the sum of the weight multiplied by
[19:00] the first node in the hidden layer
[19:03] and the way it multiplied by the second
[19:05] node in Hidden layer
[19:07] which results in a value that is
[19:09] approximately equal to one
[19:13] similarly the value of the second output
[19:16] node is approximately equal to zero
[19:21] if the value of the input node is 3.
[19:24] the network will predict that the person
[19:26] has prostate cancer
[19:29] let's focus on how this output node is
[19:32] calculated
[19:33] this is how this node was calculated
[19:37] and this is how the values of the nodes
[19:39] in a hidden layer were calculated
[19:42] let's substitute X1 and x2 in the
[19:45] equation
[19:46] like this
[19:49] and we plug in this equation in
[19:51] activation function here
[19:54] so that we have the following function
[19:56] that can compute the value of the output
[19:58] node based on any input value
[20:02] let's move the function up here
[20:04] if you now calculate the value of the
[20:06] function based on a range of different
[20:08] values of X between 0 and 4. we will be
[20:12] able to generate a curve with this shape
[20:14] which predicts with 100 accuracy
[20:18] because all individuals with prostate
[20:20] cancer will get a value of approximately
[20:22] one and will therefore be correctly
[20:24] predicted to have prostate cancer
[20:27] whereas all the healthy individuals will
[20:29] get a predicted value of approximately
[20:31] zero which will result in that they are
[20:34] correctly predicted to be healthy
[20:37] similarly the corresponding function of
[20:40] this output node
[20:41] will generate the opposite curve
[20:44] this is the magical stuff behind neural
[20:46] networks they can help us to generate
[20:49] non-linear functions that can fit to
[20:51] almost any data that seems impossible to
[20:53] do with standard statistical methods
[20:57] we'll now discuss a few similarities and
[20:59] differences between neural networks and
[21:02] standard statistical models such as the
[21:05] logistic regression model
[21:07] the input in neural Nets is usually
[21:09] called a predictor or explanatory
[21:12] variable in logistic regression
[21:14] the output in neural Nets is usually
[21:17] called the response variable in logistic
[21:19] regression
[21:21] the weights in neural networks are
[21:23] usually called coefficients or
[21:25] parameters in logistic regression
[21:28] The Intercept in regression corresponds
[21:31] to the bias
[21:33] we usually say that we train the neural
[21:36] network whereas the corresponding thing
[21:38] in regression is called that we fit them
[21:40] all to the data would that be estimate
[21:43] the parameters in the model
[21:46] the parameters in regression are
[21:48] estimated by minimizing some error
[21:50] function such as the sum of squares in
[21:53] regression
[21:54] whereas the corresponding thing is
[21:57] called back propagation in neural Nets
[22:00] because we randomly assign the weights
[22:02] with some numbers
[22:04] then we see how well the network
[22:06] predicts the training data
[22:09] then we go back to update the values
[22:12] so that we improve the prediction
[22:15] this is iterated until we reach some
[22:17] convergence where the network can no
[22:19] longer improve the predictions based on
[22:21] the training data
[22:24] the parameters in regression usually
[22:27] have a meaning that we can interpret
[22:29] such as the odds ratio in the district
[22:31] regression
[22:32] whereas the values of the ways in the
[22:34] neural network usually have no meaning
[22:36] rather that they're used to make a good
[22:39] prediction
[22:40] in regression we usually compute P
[22:42] values that are associated with the
[22:44] parameters which require that to fulfill
[22:47] a set of assumptions
[22:50] since we use the estimated parameters to
[22:52] interpret the regression model we must
[22:55] find the global minimum of the error
[22:57] function
[22:58] whereas this is not crucial in neural
[23:01] networks because the network might
[23:03] predict the outcome just fine even
[23:05] though it has stopped on a local minimum
[23:09] finally we'll have a look at some basic
[23:11] R code if you like to reproduce the
[23:13] numbers in this video
[23:15] R is a free software tool that you can
[23:18] download from the following website
[23:20] we first plug in the data into an R
[23:22] script
[23:24] will he use the neural net package which
[23:27] you first need to install if you have
[23:29] not already done that
[23:31] then we load this package
[23:34] and train the neural network
[23:36] where we try to predict cancer versus
[23:38] healthy based on the PSA level which is
[23:42] the only input in this case
[23:44] we plug in the training data here
[23:47] since we in our example like to chain a
[23:49] neural network without a hidden layer we
[23:52] set this argument to zero
[23:55] and use the logistic activation function
[23:58] the arrow function is here set across
[24:01] entropy which in this example is the
[24:03] same thing as using the negative log
[24:05] likelihood function to estimate the
[24:07] weights
[24:10] we can print the output
[24:12] and plot the network
[24:14] after
[24:16] 754 iterations to optimize the values of
[24:19] the weights in order to minimize the
[24:21] negative log likelihood function
[24:23] we have an error of 10.54 which in this
[24:27] case corresponds to the negative log
[24:29] likelihood value
[24:30] the smaller this value is the better the
[24:33] mole fits the data
[24:36] for example if you like to use the
[24:38] network for prediction we can use the
[24:41] function predict
[24:42] in this example the network will predict
[24:45] that the person with a PC level of 2 is
[24:48] healthy
[24:50] if you like to compare with the logistic
[24:52] regression model you can run the
[24:54] following line
[24:56] if you like to change the error function
[24:58] to the sum of squared errors you can set
[25:01] this argument to SSE
[25:03] the reason why here use a lower
[25:06] threshold for convergence compared to
[25:08] the default value of 0.01 is simply to
[25:11] get similar values as in logistic
[25:13] regression but this will be more
[25:15] computational expensive
[25:18] I also recommend that you run a number
[25:20] of repetitions
[25:22] in this example we will generate 10
[25:24] networks that have been based on
[25:26] different initial values of the weights
[25:30] then you can study the output of these
[25:32] 10 networks and select the one with the
[25:34] lowest error
[25:37] in future videos we'll have a look at
[25:40] networks where the output can result in
[25:42] several categories compared to the
[25:44] simple binary case we used in this video
[25:48] we'll also have a look at networks where
[25:50] the output is a continuous variable
[25:53] and see how the number of hidden nodes
[25:55] affects the output we will also discuss
[25:57] the problem of overfitting
[26:01] in another video I will cover non-linear
[26:03] regression which is highly related to
[26:06] neural networks
[26:07] this was the end of this basic video
[26:09] about artificial neural networks thanks
[26:12] for watching