[0:00] welcome to this basic video about [0:02] artificial neural networks [0:05] a neural network consists of input nodes [0:09] a hidden layer [0:11] and output nodes [0:14] for example these input nodes may [0:16] represent three measurements that we do [0:19] to determine if someone has prostate [0:22] cancer or not [0:24] suppose that the person enters the [0:27] hospital [0:28] we check the age [0:30] and the concentration of the prostate [0:32] specific antigen from a blood sample [0:35] and collect a score between 1 and 5 from [0:38] an MRI scan of the prostate [0:41] then we plug in the corresponding values [0:43] in the input nodes [0:46] and use the network to tell if the [0:48] person has prostate cancer or not [0:51] to understand how neural Nets work we [0:54] will here have a look at the super [0:56] simple example [0:58] at the end of this video we'll try to [1:00] understand what happens if you include a [1:02] hidden layer [1:04] a look at some simple R code if you like [1:06] to reproduce the first example shown in [1:08] this video [1:11] we will here see how to train a neural [1:13] network to predict if someone has [1:15] prostate cancer based on the PSA level [1:18] our training data consists of seven [1:21] patients that we know have prostate [1:23] cancer [1:24] and seven individuals that we know are [1:27] healthy based on a blood sample [1:30] we have determined the concentration of [1:32] PSA or the prostate specific antigen in [1:35] the blood of these individuals [1:38] note that this data has been simulated [1:40] for the purpose of this video [1:43] we'll here use the simplest possible [1:45] neural network model [1:48] since we only have one measured variable [1:50] the PSA concentration the network will [1:53] only have just one input node [1:56] note that we here do not have any hidden [1:58] layer [1:59] because we like to use the simplest [2:01] possible Network [2:03] the network has two output nodes because [2:06] we here like to use the network to [2:08] predict if someone has prostate cancer [2:10] or not [2:12] this is called a bias which is used to [2:15] modify the activation function [2:18] there are several different activation [2:20] functions to select between [2:23] we will here use the sigmoid function [2:25] which is the same function that is used [2:27] in logistic regression [2:29] this will later allow us to compare the [2:32] neural network with logistic regression [2:35] let's make a plot of this data [2:39] like this [2:41] where these points represent the cancer [2:44] patients that are hair coded as once [2:48] whereas these points represents the [2:50] healthy individuals that are coded as [2:53] zeros [2:54] for example this healthy individual has [2:57] a PSA level of 2.5 [3:01] whereas this cancer patient has a PSA [3:03] level of 2.1 [3:07] training a neural network means that we [3:10] find the optimal values of the weights [3:14] and the weight associated with the [3:15] buyers [3:17] so that the activation function in this [3:19] case [3:21] generates a sigmoid curve that can be [3:24] used to predict the outcome in an [3:25] optimal way [3:27] note that the logistic activation [3:29] function generates a value between 0 and [3:32] 1. [3:34] and that e is the Euler's number [3:37] if you train this simple Network we will [3:40] obtain the following weights [3:43] let's use this network to see if this [3:46] healthy person is correctly predicted as [3:49] being healthy by the network [3:52] we plug in its PSA level of 2.0 in the [3:55] input node [3:57] to calculate the probability that the [3:59] person has cancer [4:02] we use the weights associated with [4:04] output node [4:06] in this equation [4:08] where W cro is the bias weight [4:11] and this is the way for the input signal [4:14] whereas this is the value of the input [4:16] node [4:18] since we only have one weight that is [4:20] associated with the output node n is [4:23] here equal to 1. [4:25] let's plug in the weights in the [4:27] equation [4:29] and the value of the input node [4:32] and do the math [4:34] we now plug in this value in the [4:37] logistic activation function [4:39] and calculate [4:41] this value is the value for the output [4:44] node cancer [4:45] which can in this case be seen as the [4:47] probability that the person has prostate [4:50] cancer [4:52] let's place this value here [4:55] we'll now calculate the value of this [4:57] output node [4:59] where we plug in the corresponding [5:01] values in the equation [5:03] and do the math [5:06] let's place the corresponding value here [5:08] which can be seen as the probability [5:10] that the person is healthy [5:13] we can now use some can of threshold to [5:16] determine if you should classify the [5:17] individual to have prostate cancer or [5:19] not based on these two values [5:23] a common kind of value to use in binary [5:25] classification is 0.5 because it is the [5:29] value between 0 and 1. [5:32] based on this cutoff value the network [5:34] will predict that the person with the [5:37] PHA level of 2 is healthy or cost output [5:40] value associated with the healthy class [5:43] is greater than 0.5 [5:46] note that this value is exactly the same [5:49] value [5:50] as if we would draw a vertical line from [5:53] 2 to the logistic curve [5:56] and then the horizontal line from the [5:58] curve to the y-axis [6:00] the height of the activation function is [6:02] therefore equal to 0.438 [6:06] this value is then simply just 1 minus [6:10] 0.438 [6:13] this value is equal to the height of the [6:15] flipped curve [6:18] because the slope of the curve should be [6:20] negative if you go this way in the [6:22] network [6:24] if you have seen my previous videos [6:26] about logistic regression you will note [6:28] that such a model reproduced exact same [6:31] results [6:33] the estimated parameters of the logistic [6:35] regression model [6:37] are identical to the weights in this [6:40] simple neural network [6:42] P0 or The Intercept corresponds to the [6:46] bias [6:47] whereas B1 corresponds to the weight [6:50] associated with the output node for [6:52] prostate cancer [6:55] if we plug in this equation into the [6:57] activation function [7:00] we will obtain the exact same function [7:02] that is used in logistic regression [7:06] we can therefore conclude that this [7:07] simple neural network using the logistic [7:10] activation function [7:12] is identical to logistic regression [7:16] let's see how well this neural network [7:18] can predict the class of our training [7:21] data [7:22] this is the data that we used to train [7:25] the network and this is the values [7:27] generated by the output node for cancer [7:30] which can be obtained if you plug in [7:32] these values in the equation [7:36] and the weights [7:37] and then compute the corresponding [7:39] values by the activation function [7:42] for example this is the value that we [7:45] calculated by hand previously based on [7:47] the PSA level of 2.0 [7:50] which corresponds to the height of the [7:52] activation function [7:54] we can now use the output values and the [7:57] cutoff values 0.5 to see how well the [8:00] model predicts the training data [8:02] since these individuals had values [8:05] greater than 0.5 they are predicted to [8:08] have prostate cancer [8:09] since you know that these individuals [8:12] actually have prostate cancer we know [8:14] that the neural network has made the [8:16] correct predictions for these [8:18] individuals [8:19] however the network incorrectly predicts [8:22] that this cancer patient is healthy [8:25] these individuals have output values [8:28] that are less than 0.5 [8:30] which means that they are predicted to [8:32] be healthy since we know that these [8:35] individuals are healthy we know that [8:37] they have been correctly classified by [8:39] the network [8:41] this healthy person is incorrectly [8:43] classified as having prostate cancer [8:45] because it has a relatively high PSA [8:48] level [8:49] in total the network makes 12 correct [8:52] predictions out of 14 possible [8:55] which gives an accuracy of about 86 [8:57] percent [8:59] however to get the fair estimate of how [9:02] well the network would perform on new [9:04] data [9:06] we should use a test data set or [9:08] cross-validation as we have done in all [9:11] other machine learning methods that we [9:13] have discussed so far [9:15] watch the video about validation to see [9:18] how to evaluate the classifier by the [9:20] holdout method or by cross-validation [9:23] once they have established a neural [9:25] network that has been trained on some [9:27] data we can use it for prediction [9:32] suppose that we like to test if this [9:34] person has prostate cancer or not [9:37] from a blood sample [9:39] we measure the PSA concentration to 1.75 [9:43] nanograms per ml [9:45] we plug in the value in the network [9:48] we can then compute the values of the [9:50] output nodes [9:52] we therefore plug in the value of the [9:54] input node here [9:57] the weight of the buyers [10:00] and this weight here [10:03] and do the map [10:05] then we plug in this value in the [10:07] activation function [10:09] and do the math again [10:11] 1 minus 0.281 is 0.719 [10:17] since this value is larger than the [10:19] cutoff value of 0.5 the network predicts [10:23] that the person is healthy [10:25] we'll now try to understand where the [10:28] values for the weights come from [10:31] in this example these are the optimal [10:34] values for the weights [10:36] whereas these are the optimal values for [10:39] the bias weights [10:42] the weights are optimized by using some [10:45] sort of cost function [10:46] usually the maximum likelihood method [10:49] for binary classification that we have [10:51] covered in the videos about logistic [10:53] regression [10:54] the following weights were optimized by [10:57] the negative log likelihood function [11:01] we can also use the method of ordinary [11:03] squares I will here explain the concept [11:06] of ordinary squares because it is a bit [11:09] simpler to understand [11:12] in the videos about logistic regression [11:14] I show how to calculate the log [11:16] likelihood [11:17] and in another video I share the [11:20] difference between orderly squares and [11:22] the maximum likelihood method based on [11:24] the simple linear model [11:27] when we use the method of lead squares [11:29] we try to minimize this function that [11:32] computes the sum of the squared errors [11:35] these are the Y values of the [11:37] observations which are in this example [11:40] equal to zero if it is a healthy person [11:45] and one if it is a patient with prostate [11:48] cancer [11:49] y hat is the value of the reactivation [11:52] function that can go between 0 and 1. [11:55] this difference is called a residual or [11:59] an error which can be seen as the [12:01] distance between observations and the [12:04] Curve [12:05] the Y value of this data point is zero [12:09] and the value of the curve on the PSA [12:11] level is equal to 2.5 is 0.75 [12:16] which can be calculated like this based [12:19] on the weights we previously estimated [12:22] this results in a residual of negative [12:24] 0.75 [12:27] we then Square this residual [12:30] we then do the same calculations for the [12:33] next data point and so forth [12:35] then we sum all the square residuals [12:38] which here results in a value of 1.78 [12:43] suppose that we now change the value of [12:45] the bias from negative 5.754 [12:50] to negative 8. [12:53] that will move the curve a bit to the [12:56] right [12:57] which will increase the sum of squared [12:59] errors because the data points are now [13:02] much farther away from the Curve [13:05] let's paste the point here which [13:07] represents that the sum of squared [13:09] residuals or errors is 2.88 and the bias [13:13] is set to negative 8. [13:15] when the bias is set to negative 5.754 [13:20] the sum of the squared errors was equal [13:22] to 1.78 [13:25] if you set the bias to negative 3 [13:28] we'll move the curve a bit to the left [13:32] which will result in that the sum of the [13:34] squared errors is equal to 3.62 [13:39] if we would try many different values of [13:41] the buyer's weight [13:42] we will be able to generate a curve that [13:45] shows at the sum of the squared errors [13:48] changes we will change the value or the [13:51] weight associated with the bias [13:54] the method of least squares finds the [13:57] value of the weight that results in the [13:59] lowest possible sum of squared errors [14:03] this explains whether weight of the bias [14:05] is equal to about negative 5.8 in this [14:08] example because that value results in a [14:12] curve that is as close as possible to [14:14] the data points which we'll result in [14:16] that the network predicts the prostate [14:18] cancer in the best way [14:22] to find the value of the weight that [14:24] results in the lowest sum of squared [14:25] errors [14:27] we need to start with an initial guess [14:30] suppose that we initially guess the [14:32] value or the weight to negative three [14:35] then we use an algorithm such as the [14:37] gradient descent where we step by smooth [14:40] along the direction of the steepest [14:42] descent [14:43] like this [14:46] If instead initially guess the value of [14:48] the bias to negative 9. [14:52] the gradient descent method will instead [14:54] increase the value of the weight until [14:56] it finds the value that minimizes the [14:58] sum of the squared errors [15:01] one problem occurs when we had a more [15:04] complicated Network because the error [15:06] function for such a network would then [15:09] have several local minimals [15:12] this means that if initially gets the [15:14] value of the bias to negative 9 [15:17] a method will move down to the local [15:19] minimum and Report negative 8 as the [15:22] best value [15:24] which will result in a bad fit to the [15:26] data [15:28] if you initially guess the value to [15:30] negative 6.5 [15:32] or negative 3.5 [15:34] then we will reach the global minimum [15:39] which results in the best fit [15:42] it is therefore important that we try [15:44] many initial guesses or the weights [15:46] during the learning process to find a [15:48] network that can predict the data in the [15:51] best way [15:52] most software tools Generate random [15:54] initial guesses or the weights [15:57] which means that you might get different [15:58] results every time you train your [16:01] network [16:02] we can make a similar error curve if we [16:05] change this weight [16:07] when we change this weight the Curve [16:09] will change its slope [16:12] when we estimate to weight [16:13] simultaneously the error function will [16:16] correspond to surface in a [16:19] three-dimensional plot like this [16:21] the combined values of the two weights [16:24] that result in the minimum value of this [16:26] function will result in the best fit to [16:28] the data [16:30] we'll now try to understand the purpose [16:32] of using a hidden layer in neural [16:34] networks [16:35] to show the beauty of the Hidden layers [16:39] suppose that we have the following data [16:41] where one has measured a certain protein [16:44] in blood samples from 10 cancer patients [16:47] and five Health individuals [16:50] people who are healthy have an [16:52] intermediate level of this protein in [16:55] their blood [16:56] or as people with prostate cancer either [16:58] have a low level [17:00] or a high level [17:02] if you would try to fit a logistic [17:04] regimal would trade a neural network [17:07] with a logistic activation function [17:09] without a hidden layer [17:12] maybe failed to make a good prediction [17:14] of the training data because an s-shaped [17:17] curve is simply not useful for this type [17:19] of data [17:21] these cancer patients will incorrectly [17:23] be predicted to be healthy by them all [17:27] and if you flip the curve these patients [17:29] will instead incorrectly be predicted to [17:31] be healthy [17:33] the car might also look like this [17:37] where all the healthy individuals will [17:39] now be incorrectly predicted to have [17:41] prostate cancer [17:43] what we want is a car that looks like [17:45] this [17:47] this is where the hidden layer comes in [17:49] because a neural network with one or [17:52] several hidden layers can generate cool [17:54] curves like this [17:56] we'll here extend our previous network [17:59] with one hidden layer [18:02] if you train this network [18:04] on the following data [18:07] will obtain the following values or the [18:09] weights [18:10] node if you try to compute the neural [18:13] network on the same data you will end up [18:16] with different values compared to the [18:17] one shown here because you will probably [18:20] end up on a different local minimum to [18:23] understand the calculations behind this [18:25] neural network let's see how it predicts [18:27] a person with a value of 3 or protein X [18:31] the value of this node [18:34] can be calculated like this [18:36] which results in a value that is [18:38] approximately equal to one [18:42] let's set the value of this node to 1. [18:45] the corresponding value of this node is [18:47] 0.869 [18:51] the value of this output node is [18:53] calculated based on [18:55] the bias [18:57] and the sum of the weight multiplied by [19:00] the first node in the hidden layer [19:03] and the way it multiplied by the second [19:05] node in Hidden layer [19:07] which results in a value that is [19:09] approximately equal to one [19:13] similarly the value of the second output [19:16] node is approximately equal to zero [19:21] if the value of the input node is 3. [19:24] the network will predict that the person [19:26] has prostate cancer [19:29] let's focus on how this output node is [19:32] calculated [19:33] this is how this node was calculated [19:37] and this is how the values of the nodes [19:39] in a hidden layer were calculated [19:42] let's substitute X1 and x2 in the [19:45] equation [19:46] like this [19:49] and we plug in this equation in [19:51] activation function here [19:54] so that we have the following function [19:56] that can compute the value of the output [19:58] node based on any input value [20:02] let's move the function up here [20:04] if you now calculate the value of the [20:06] function based on a range of different [20:08] values of X between 0 and 4. we will be [20:12] able to generate a curve with this shape [20:14] which predicts with 100 accuracy [20:18] because all individuals with prostate [20:20] cancer will get a value of approximately [20:22] one and will therefore be correctly [20:24] predicted to have prostate cancer [20:27] whereas all the healthy individuals will [20:29] get a predicted value of approximately [20:31] zero which will result in that they are [20:34] correctly predicted to be healthy [20:37] similarly the corresponding function of [20:40] this output node [20:41] will generate the opposite curve [20:44] this is the magical stuff behind neural [20:46] networks they can help us to generate [20:49] non-linear functions that can fit to [20:51] almost any data that seems impossible to [20:53] do with standard statistical methods [20:57] we'll now discuss a few similarities and [20:59] differences between neural networks and [21:02] standard statistical models such as the [21:05] logistic regression model [21:07] the input in neural Nets is usually [21:09] called a predictor or explanatory [21:12] variable in logistic regression [21:14] the output in neural Nets is usually [21:17] called the response variable in logistic [21:19] regression [21:21] the weights in neural networks are [21:23] usually called coefficients or [21:25] parameters in logistic regression [21:28] The Intercept in regression corresponds [21:31] to the bias [21:33] we usually say that we train the neural [21:36] network whereas the corresponding thing [21:38] in regression is called that we fit them [21:40] all to the data would that be estimate [21:43] the parameters in the model [21:46] the parameters in regression are [21:48] estimated by minimizing some error [21:50] function such as the sum of squares in [21:53] regression [21:54] whereas the corresponding thing is [21:57] called back propagation in neural Nets [22:00] because we randomly assign the weights [22:02] with some numbers [22:04] then we see how well the network [22:06] predicts the training data [22:09] then we go back to update the values [22:12] so that we improve the prediction [22:15] this is iterated until we reach some [22:17] convergence where the network can no [22:19] longer improve the predictions based on [22:21] the training data [22:24] the parameters in regression usually [22:27] have a meaning that we can interpret [22:29] such as the odds ratio in the district [22:31] regression [22:32] whereas the values of the ways in the [22:34] neural network usually have no meaning [22:36] rather that they're used to make a good [22:39] prediction [22:40] in regression we usually compute P [22:42] values that are associated with the [22:44] parameters which require that to fulfill [22:47] a set of assumptions [22:50] since we use the estimated parameters to [22:52] interpret the regression model we must [22:55] find the global minimum of the error [22:57] function [22:58] whereas this is not crucial in neural [23:01] networks because the network might [23:03] predict the outcome just fine even [23:05] though it has stopped on a local minimum [23:09] finally we'll have a look at some basic [23:11] R code if you like to reproduce the [23:13] numbers in this video [23:15] R is a free software tool that you can [23:18] download from the following website [23:20] we first plug in the data into an R [23:22] script [23:24] will he use the neural net package which [23:27] you first need to install if you have [23:29] not already done that [23:31] then we load this package [23:34] and train the neural network [23:36] where we try to predict cancer versus [23:38] healthy based on the PSA level which is [23:42] the only input in this case [23:44] we plug in the training data here [23:47] since we in our example like to chain a [23:49] neural network without a hidden layer we [23:52] set this argument to zero [23:55] and use the logistic activation function [23:58] the arrow function is here set across [24:01] entropy which in this example is the [24:03] same thing as using the negative log [24:05] likelihood function to estimate the [24:07] weights [24:10] we can print the output [24:12] and plot the network [24:14] after [24:16] 754 iterations to optimize the values of [24:19] the weights in order to minimize the [24:21] negative log likelihood function [24:23] we have an error of 10.54 which in this [24:27] case corresponds to the negative log [24:29] likelihood value [24:30] the smaller this value is the better the [24:33] mole fits the data [24:36] for example if you like to use the [24:38] network for prediction we can use the [24:41] function predict [24:42] in this example the network will predict [24:45] that the person with a PC level of 2 is [24:48] healthy [24:50] if you like to compare with the logistic [24:52] regression model you can run the [24:54] following line [24:56] if you like to change the error function [24:58] to the sum of squared errors you can set [25:01] this argument to SSE [25:03] the reason why here use a lower [25:06] threshold for convergence compared to [25:08] the default value of 0.01 is simply to [25:11] get similar values as in logistic [25:13] regression but this will be more [25:15] computational expensive [25:18] I also recommend that you run a number [25:20] of repetitions [25:22] in this example we will generate 10 [25:24] networks that have been based on [25:26] different initial values of the weights [25:30] then you can study the output of these [25:32] 10 networks and select the one with the [25:34] lowest error [25:37] in future videos we'll have a look at [25:40] networks where the output can result in [25:42] several categories compared to the [25:44] simple binary case we used in this video [25:48] we'll also have a look at networks where [25:50] the output is a continuous variable [25:53] and see how the number of hidden nodes [25:55] affects the output we will also discuss [25:57] the problem of overfitting [26:01] in another video I will cover non-linear [26:03] regression which is highly related to [26:06] neural networks [26:07] this was the end of this basic video [26:09] about artificial neural networks thanks [26:12] for watching