TubeSum ← Transcribe a video

Artificial neural networks (ANN) - explained super simple

Transcribed Jun 16, 2026 Watch on YouTube ↗
Beginner 12 min read For: Students, beginners in machine learning, or anyone looking for a non-technical introduction to neural networks.
137.2K
Views
2.3K
Likes
55
Comments
19
Dislikes
1.7%
📊 Average

AI Summary

This video provides a basic introduction to artificial neural networks (ANNs) using a simple example: predicting prostate cancer from PSA levels. It starts with a network without hidden layers, showing it is mathematically identical to logistic regression, then demonstrates how adding a hidden layer enables the network to learn complex, non-linear patterns. The video also covers weight optimization via gradient descent and includes R code for reproduction.

[0:05]
Basic Structure

A neural network consists of input nodes, a hidden layer, and output nodes. Example: three inputs (age, PSA concentration, MRI score) to predict prostate cancer.

[1:43]
Simple Network Example

A simple network with one input (PSA), no hidden layer, and two outputs (cancer/healthy) is trained on 14 patients. It uses a sigmoid activation function.

[7:12]
Equivalence to Logistic Regression

The simple neural network with logistic activation function is identical to logistic regression. The bias corresponds to the intercept, and the weight to the coefficient.

[8:55]
Training Accuracy

The network achieved 86% accuracy (12/14 correct) on the training data, but cross-validation is recommended for fair evaluation.

[10:46]
Weight Optimization

Weights are optimized by minimizing an error function (e.g., sum of squared errors or negative log-likelihood) using gradient descent. The algorithm can get stuck in local minima, so multiple random starts are recommended.

[17:47]
Power of Hidden Layers

A hidden layer allows the network to learn non-linear curves (e.g., an 'M-shaped' curve) that can perfectly separate data where healthy individuals have intermediate values and cancer patients have low or high values.

[22:32]
Weights vs. Regression Coefficients

In neural networks, weights are usually not interpretable (unlike regression coefficients). The goal is prediction, not interpretation. Local minima may still yield good predictions.

[23:09]
R Code Example

R code using the 'neuralnet' package is provided to reproduce the examples, including training, prediction, and comparison with logistic regression.

Clickbait Check

95% Legit

"The title accurately describes the content: a basic, super simple explanation of artificial neural networks."

Mentioned in this Video

Tutorial Checklist

1 23:24 Install the neuralnet package in R if not already installed.
2 23:34 Load the neuralnet package.
3 23:36 Train the neural network using neuralnet(), specifying the formula (cancer ~ PSA), training data, hidden layers (0 for no hidden layer), activation function (logistic), and error function (cross-entropy or SSE).
4 24:10 Print the output and plot the network.
5 24:41 Use the predict() function to make predictions on new data.
6 24:52 Optionally, compare with logistic regression using glm().
7 25:01 To use sum of squared errors, set the error function argument to 'SSE'.
8 25:18 Run multiple repetitions (e.g., 10) with different initial weights and select the network with the lowest error.

Study Flashcards (10)

What are the three main components of a neural network?

easy Click to reveal answer

Input nodes, hidden layer(s), and output nodes.

0:05

What activation function is used in the simple neural network example?

easy Click to reveal answer

The sigmoid function.

2:23

What standard statistical model is a simple neural network (no hidden layer, sigmoid activation) identical to?

medium Click to reveal answer

Logistic regression.

7:12

In a neural network, what term corresponds to the intercept in logistic regression?

easy Click to reveal answer

The bias.

21:31

What accuracy did the simple neural network achieve on the training data?

medium Click to reveal answer

86% (12 out of 14 correct predictions).

8:55

Name two error functions used to train the neural network in the video.

medium Click to reveal answer

The sum of squared errors (SSE) or the negative log-likelihood.

10:46

What algorithm is used to find the optimal weights by moving step-by-step along the steepest descent?

medium Click to reveal answer

Gradient descent.

14:37

What is the key advantage of adding a hidden layer to a neural network?

hard Click to reveal answer

They can generate non-linear curves that fit complex data patterns, like an 'M-shaped' curve.

17:47

How do the weights in a neural network differ from coefficients in logistic regression in terms of interpretability?

hard Click to reveal answer

The weights in a neural network usually have no interpretable meaning; they are just used for prediction.

22:32

What is the process called in neural networks where weights are randomly assigned, predictions are checked, and then weights are updated to improve prediction?

hard Click to reveal answer

Back propagation.

21:57

💡 Key Takeaways

📊

Neural Network = Logistic Regression

Establishes a direct equivalence between a simple neural network and a standard statistical model, demystifying neural networks for those familiar with regression.

7:12
🔧

Gradient Descent Explained

Clearly explains the core optimization algorithm used to train neural networks, including the concept of local vs. global minima.

14:37
💡

Hidden Layers Enable Non-Linearity

Demonstrates the key advantage of hidden layers: they allow the network to learn complex, non-linear decision boundaries that simple models cannot.

17:47
⚖️

Weights Are Not Interpretable

Highlights a fundamental difference between neural networks and regression models: neural network weights are tools for prediction, not for interpretation.

22:32

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Neural Networks Explained in 50 Seconds

50s

Quick, clear explanation of neural network structure with a relatable medical example hooks viewers immediately.

▶ Play Clip

How Neural Networks Predict Cancer

59s

Step-by-step walkthrough of a real prediction using PSA levels makes complex AI tangible and educational.

▶ Play Clip

Neural Network vs Logistic Regression

59s

Revealing that a simple neural network is identical to logistic regression sparks curiosity and challenges assumptions.

▶ Play Clip

Why Hidden Layers Are Magical

59s

Visual demonstration of how hidden layers solve non-linear problems that simple models can't, showcasing AI's power.

▶ Play Clip

The Problem with Local Minima

59s

Explaining a key challenge in training neural networks with a clear visual analogy makes a complex topic accessible and engaging.

▶ Play Clip

[00:00] welcome to this basic video about

[00:02] artificial neural networks

[00:05] a neural network consists of input nodes

[00:09] a hidden layer

[00:11] and output nodes

[00:14] for example these input nodes may

[00:16] represent three measurements that we do

[00:19] to determine if someone has prostate

[00:22] cancer or not

[00:24] suppose that the person enters the

[00:27] hospital

[00:28] we check the age

[00:30] and the concentration of the prostate

[00:32] specific antigen from a blood sample

[00:35] and collect a score between 1 and 5 from

[00:38] an MRI scan of the prostate

[00:41] then we plug in the corresponding values

[00:43] in the input nodes

[00:46] and use the network to tell if the

[00:48] person has prostate cancer or not

[00:51] to understand how neural Nets work we

[00:54] will here have a look at the super

[00:56] simple example

[00:58] at the end of this video we'll try to

[01:00] understand what happens if you include a

[01:02] hidden layer

[01:04] a look at some simple R code if you like

[01:06] to reproduce the first example shown in

[01:08] this video

[01:11] we will here see how to train a neural

[01:13] network to predict if someone has

[01:15] prostate cancer based on the PSA level

[01:18] our training data consists of seven

[01:21] patients that we know have prostate

[01:23] cancer

[01:24] and seven individuals that we know are

[01:27] healthy based on a blood sample

[01:30] we have determined the concentration of

[01:32] PSA or the prostate specific antigen in

[01:35] the blood of these individuals

[01:38] note that this data has been simulated

[01:40] for the purpose of this video

[01:43] we'll here use the simplest possible

[01:45] neural network model

[01:48] since we only have one measured variable

[01:50] the PSA concentration the network will

[01:53] only have just one input node

[01:56] note that we here do not have any hidden

[01:58] layer

[01:59] because we like to use the simplest

[02:01] possible Network

[02:03] the network has two output nodes because

[02:06] we here like to use the network to

[02:08] predict if someone has prostate cancer

[02:10] or not

[02:12] this is called a bias which is used to

[02:15] modify the activation function

[02:18] there are several different activation

[02:20] functions to select between

[02:23] we will here use the sigmoid function

[02:25] which is the same function that is used

[02:27] in logistic regression

[02:29] this will later allow us to compare the

[02:32] neural network with logistic regression

[02:35] let's make a plot of this data

[02:39] like this

[02:41] where these points represent the cancer

[02:44] patients that are hair coded as once

[02:48] whereas these points represents the

[02:50] healthy individuals that are coded as

[02:53] zeros

[02:54] for example this healthy individual has

[02:57] a PSA level of 2.5

[03:01] whereas this cancer patient has a PSA

[03:03] level of 2.1

[03:07] training a neural network means that we

[03:10] find the optimal values of the weights

[03:14] and the weight associated with the

[03:15] buyers

[03:17] so that the activation function in this

[03:19] case

[03:21] generates a sigmoid curve that can be

[03:24] used to predict the outcome in an

[03:25] optimal way

[03:27] note that the logistic activation

[03:29] function generates a value between 0 and

[03:32] 1.

[03:34] and that e is the Euler's number

[03:37] if you train this simple Network we will

[03:40] obtain the following weights

[03:43] let's use this network to see if this

[03:46] healthy person is correctly predicted as

[03:49] being healthy by the network

[03:52] we plug in its PSA level of 2.0 in the

[03:55] input node

[03:57] to calculate the probability that the

[03:59] person has cancer

[04:02] we use the weights associated with

[04:04] output node

[04:06] in this equation

[04:08] where W cro is the bias weight

[04:11] and this is the way for the input signal

[04:14] whereas this is the value of the input

[04:16] node

[04:18] since we only have one weight that is

[04:20] associated with the output node n is

[04:23] here equal to 1.

[04:25] let's plug in the weights in the

[04:27] equation

[04:29] and the value of the input node

[04:32] and do the math

[04:34] we now plug in this value in the

[04:37] logistic activation function

[04:39] and calculate

[04:41] this value is the value for the output

[04:44] node cancer

[04:45] which can in this case be seen as the

[04:47] probability that the person has prostate

[04:50] cancer

[04:52] let's place this value here

[04:55] we'll now calculate the value of this

[04:57] output node

[04:59] where we plug in the corresponding

[05:01] values in the equation

[05:03] and do the math

[05:06] let's place the corresponding value here

[05:08] which can be seen as the probability

[05:10] that the person is healthy

[05:13] we can now use some can of threshold to

[05:16] determine if you should classify the

[05:17] individual to have prostate cancer or

[05:19] not based on these two values

[05:23] a common kind of value to use in binary

[05:25] classification is 0.5 because it is the

[05:29] value between 0 and 1.

[05:32] based on this cutoff value the network

[05:34] will predict that the person with the

[05:37] PHA level of 2 is healthy or cost output

[05:40] value associated with the healthy class

[05:43] is greater than 0.5

[05:46] note that this value is exactly the same

[05:49] value

[05:50] as if we would draw a vertical line from

[05:53] 2 to the logistic curve

[05:56] and then the horizontal line from the

[05:58] curve to the y-axis

[06:00] the height of the activation function is

[06:02] therefore equal to 0.438

[06:06] this value is then simply just 1 minus

[06:10] 0.438

[06:13] this value is equal to the height of the

[06:15] flipped curve

[06:18] because the slope of the curve should be

[06:20] negative if you go this way in the

[06:22] network

[06:24] if you have seen my previous videos

[06:26] about logistic regression you will note

[06:28] that such a model reproduced exact same

[06:31] results

[06:33] the estimated parameters of the logistic

[06:35] regression model

[06:37] are identical to the weights in this

[06:40] simple neural network

[06:42] P0 or The Intercept corresponds to the

[06:46] bias

[06:47] whereas B1 corresponds to the weight

[06:50] associated with the output node for

[06:52] prostate cancer

[06:55] if we plug in this equation into the

[06:57] activation function

[07:00] we will obtain the exact same function

[07:02] that is used in logistic regression

[07:06] we can therefore conclude that this

[07:07] simple neural network using the logistic

[07:10] activation function

[07:12] is identical to logistic regression

[07:16] let's see how well this neural network

[07:18] can predict the class of our training

[07:21] data

[07:22] this is the data that we used to train

[07:25] the network and this is the values

[07:27] generated by the output node for cancer

[07:30] which can be obtained if you plug in

[07:32] these values in the equation

[07:36] and the weights

[07:37] and then compute the corresponding

[07:39] values by the activation function

[07:42] for example this is the value that we

[07:45] calculated by hand previously based on

[07:47] the PSA level of 2.0

[07:50] which corresponds to the height of the

[07:52] activation function

[07:54] we can now use the output values and the

[07:57] cutoff values 0.5 to see how well the

[08:00] model predicts the training data

[08:02] since these individuals had values

[08:05] greater than 0.5 they are predicted to

[08:08] have prostate cancer

[08:09] since you know that these individuals

[08:12] actually have prostate cancer we know

[08:14] that the neural network has made the

[08:16] correct predictions for these

[08:18] individuals

[08:19] however the network incorrectly predicts

[08:22] that this cancer patient is healthy

[08:25] these individuals have output values

[08:28] that are less than 0.5

[08:30] which means that they are predicted to

[08:32] be healthy since we know that these

[08:35] individuals are healthy we know that

[08:37] they have been correctly classified by

[08:39] the network

[08:41] this healthy person is incorrectly

[08:43] classified as having prostate cancer

[08:45] because it has a relatively high PSA

[08:48] level

[08:49] in total the network makes 12 correct

[08:52] predictions out of 14 possible

[08:55] which gives an accuracy of about 86

[08:57] percent

[08:59] however to get the fair estimate of how

[09:02] well the network would perform on new

[09:04] data

[09:06] we should use a test data set or

[09:08] cross-validation as we have done in all

[09:11] other machine learning methods that we

[09:13] have discussed so far

[09:15] watch the video about validation to see

[09:18] how to evaluate the classifier by the

[09:20] holdout method or by cross-validation

[09:23] once they have established a neural

[09:25] network that has been trained on some

[09:27] data we can use it for prediction

[09:32] suppose that we like to test if this

[09:34] person has prostate cancer or not

[09:37] from a blood sample

[09:39] we measure the PSA concentration to 1.75

[09:43] nanograms per ml

[09:45] we plug in the value in the network

[09:48] we can then compute the values of the

[09:50] output nodes

[09:52] we therefore plug in the value of the

[09:54] input node here

[09:57] the weight of the buyers

[10:00] and this weight here

[10:03] and do the map

[10:05] then we plug in this value in the

[10:07] activation function

[10:09] and do the math again

[10:11] 1 minus 0.281 is 0.719

[10:17] since this value is larger than the

[10:19] cutoff value of 0.5 the network predicts

[10:23] that the person is healthy

[10:25] we'll now try to understand where the

[10:28] values for the weights come from

[10:31] in this example these are the optimal

[10:34] values for the weights

[10:36] whereas these are the optimal values for

[10:39] the bias weights

[10:42] the weights are optimized by using some

[10:45] sort of cost function

[10:46] usually the maximum likelihood method

[10:49] for binary classification that we have

[10:51] covered in the videos about logistic

[10:53] regression

[10:54] the following weights were optimized by

[10:57] the negative log likelihood function

[11:01] we can also use the method of ordinary

[11:03] squares I will here explain the concept

[11:06] of ordinary squares because it is a bit

[11:09] simpler to understand

[11:12] in the videos about logistic regression

[11:14] I show how to calculate the log

[11:16] likelihood

[11:17] and in another video I share the

[11:20] difference between orderly squares and

[11:22] the maximum likelihood method based on

[11:24] the simple linear model

[11:27] when we use the method of lead squares

[11:29] we try to minimize this function that

[11:32] computes the sum of the squared errors

[11:35] these are the Y values of the

[11:37] observations which are in this example

[11:40] equal to zero if it is a healthy person

[11:45] and one if it is a patient with prostate

[11:48] cancer

[11:49] y hat is the value of the reactivation

[11:52] function that can go between 0 and 1.

[11:55] this difference is called a residual or

[11:59] an error which can be seen as the

[12:01] distance between observations and the

[12:04] Curve

[12:05] the Y value of this data point is zero

[12:09] and the value of the curve on the PSA

[12:11] level is equal to 2.5 is 0.75

[12:16] which can be calculated like this based

[12:19] on the weights we previously estimated

[12:22] this results in a residual of negative

[12:24] 0.75

[12:27] we then Square this residual

[12:30] we then do the same calculations for the

[12:33] next data point and so forth

[12:35] then we sum all the square residuals

[12:38] which here results in a value of 1.78

[12:43] suppose that we now change the value of

[12:45] the bias from negative 5.754

[12:50] to negative 8.

[12:53] that will move the curve a bit to the

[12:56] right

[12:57] which will increase the sum of squared

[12:59] errors because the data points are now

[13:02] much farther away from the Curve

[13:05] let's paste the point here which

[13:07] represents that the sum of squared

[13:09] residuals or errors is 2.88 and the bias

[13:13] is set to negative 8.

[13:15] when the bias is set to negative 5.754

[13:20] the sum of the squared errors was equal

[13:22] to 1.78

[13:25] if you set the bias to negative 3

[13:28] we'll move the curve a bit to the left

[13:32] which will result in that the sum of the

[13:34] squared errors is equal to 3.62

[13:39] if we would try many different values of

[13:41] the buyer's weight

[13:42] we will be able to generate a curve that

[13:45] shows at the sum of the squared errors

[13:48] changes we will change the value or the

[13:51] weight associated with the bias

[13:54] the method of least squares finds the

[13:57] value of the weight that results in the

[13:59] lowest possible sum of squared errors

[14:03] this explains whether weight of the bias

[14:05] is equal to about negative 5.8 in this

[14:08] example because that value results in a

[14:12] curve that is as close as possible to

[14:14] the data points which we'll result in

[14:16] that the network predicts the prostate

[14:18] cancer in the best way

[14:22] to find the value of the weight that

[14:24] results in the lowest sum of squared

[14:25] errors

[14:27] we need to start with an initial guess

[14:30] suppose that we initially guess the

[14:32] value or the weight to negative three

[14:35] then we use an algorithm such as the

[14:37] gradient descent where we step by smooth

[14:40] along the direction of the steepest

[14:42] descent

[14:43] like this

[14:46] If instead initially guess the value of

[14:48] the bias to negative 9.

[14:52] the gradient descent method will instead

[14:54] increase the value of the weight until

[14:56] it finds the value that minimizes the

[14:58] sum of the squared errors

[15:01] one problem occurs when we had a more

[15:04] complicated Network because the error

[15:06] function for such a network would then

[15:09] have several local minimals

[15:12] this means that if initially gets the

[15:14] value of the bias to negative 9

[15:17] a method will move down to the local

[15:19] minimum and Report negative 8 as the

[15:22] best value

[15:24] which will result in a bad fit to the

[15:26] data

[15:28] if you initially guess the value to

[15:30] negative 6.5

[15:32] or negative 3.5

[15:34] then we will reach the global minimum

[15:39] which results in the best fit

[15:42] it is therefore important that we try

[15:44] many initial guesses or the weights

[15:46] during the learning process to find a

[15:48] network that can predict the data in the

[15:51] best way

[15:52] most software tools Generate random

[15:54] initial guesses or the weights

[15:57] which means that you might get different

[15:58] results every time you train your

[16:01] network

[16:02] we can make a similar error curve if we

[16:05] change this weight

[16:07] when we change this weight the Curve

[16:09] will change its slope

[16:12] when we estimate to weight

[16:13] simultaneously the error function will

[16:16] correspond to surface in a

[16:19] three-dimensional plot like this

[16:21] the combined values of the two weights

[16:24] that result in the minimum value of this

[16:26] function will result in the best fit to

[16:28] the data

[16:30] we'll now try to understand the purpose

[16:32] of using a hidden layer in neural

[16:34] networks

[16:35] to show the beauty of the Hidden layers

[16:39] suppose that we have the following data

[16:41] where one has measured a certain protein

[16:44] in blood samples from 10 cancer patients

[16:47] and five Health individuals

[16:50] people who are healthy have an

[16:52] intermediate level of this protein in

[16:55] their blood

[16:56] or as people with prostate cancer either

[16:58] have a low level

[17:00] or a high level

[17:02] if you would try to fit a logistic

[17:04] regimal would trade a neural network

[17:07] with a logistic activation function

[17:09] without a hidden layer

[17:12] maybe failed to make a good prediction

[17:14] of the training data because an s-shaped

[17:17] curve is simply not useful for this type

[17:19] of data

[17:21] these cancer patients will incorrectly

[17:23] be predicted to be healthy by them all

[17:27] and if you flip the curve these patients

[17:29] will instead incorrectly be predicted to

[17:31] be healthy

[17:33] the car might also look like this

[17:37] where all the healthy individuals will

[17:39] now be incorrectly predicted to have

[17:41] prostate cancer

[17:43] what we want is a car that looks like

[17:45] this

[17:47] this is where the hidden layer comes in

[17:49] because a neural network with one or

[17:52] several hidden layers can generate cool

[17:54] curves like this

[17:56] we'll here extend our previous network

[17:59] with one hidden layer

[18:02] if you train this network

[18:04] on the following data

[18:07] will obtain the following values or the

[18:09] weights

[18:10] node if you try to compute the neural

[18:13] network on the same data you will end up

[18:16] with different values compared to the

[18:17] one shown here because you will probably

[18:20] end up on a different local minimum to

[18:23] understand the calculations behind this

[18:25] neural network let's see how it predicts

[18:27] a person with a value of 3 or protein X

[18:31] the value of this node

[18:34] can be calculated like this

[18:36] which results in a value that is

[18:38] approximately equal to one

[18:42] let's set the value of this node to 1.

[18:45] the corresponding value of this node is

[18:47] 0.869

[18:51] the value of this output node is

[18:53] calculated based on

[18:55] the bias

[18:57] and the sum of the weight multiplied by

[19:00] the first node in the hidden layer

[19:03] and the way it multiplied by the second

[19:05] node in Hidden layer

[19:07] which results in a value that is

[19:09] approximately equal to one

[19:13] similarly the value of the second output

[19:16] node is approximately equal to zero

[19:21] if the value of the input node is 3.

[19:24] the network will predict that the person

[19:26] has prostate cancer

[19:29] let's focus on how this output node is

[19:32] calculated

[19:33] this is how this node was calculated

[19:37] and this is how the values of the nodes

[19:39] in a hidden layer were calculated

[19:42] let's substitute X1 and x2 in the

[19:45] equation

[19:46] like this

[19:49] and we plug in this equation in

[19:51] activation function here

[19:54] so that we have the following function

[19:56] that can compute the value of the output

[19:58] node based on any input value

[20:02] let's move the function up here

[20:04] if you now calculate the value of the

[20:06] function based on a range of different

[20:08] values of X between 0 and 4. we will be

[20:12] able to generate a curve with this shape

[20:14] which predicts with 100 accuracy

[20:18] because all individuals with prostate

[20:20] cancer will get a value of approximately

[20:22] one and will therefore be correctly

[20:24] predicted to have prostate cancer

[20:27] whereas all the healthy individuals will

[20:29] get a predicted value of approximately

[20:31] zero which will result in that they are

[20:34] correctly predicted to be healthy

[20:37] similarly the corresponding function of

[20:40] this output node

[20:41] will generate the opposite curve

[20:44] this is the magical stuff behind neural

[20:46] networks they can help us to generate

[20:49] non-linear functions that can fit to

[20:51] almost any data that seems impossible to

[20:53] do with standard statistical methods

[20:57] we'll now discuss a few similarities and

[20:59] differences between neural networks and

[21:02] standard statistical models such as the

[21:05] logistic regression model

[21:07] the input in neural Nets is usually

[21:09] called a predictor or explanatory

[21:12] variable in logistic regression

[21:14] the output in neural Nets is usually

[21:17] called the response variable in logistic

[21:19] regression

[21:21] the weights in neural networks are

[21:23] usually called coefficients or

[21:25] parameters in logistic regression

[21:28] The Intercept in regression corresponds

[21:31] to the bias

[21:33] we usually say that we train the neural

[21:36] network whereas the corresponding thing

[21:38] in regression is called that we fit them

[21:40] all to the data would that be estimate

[21:43] the parameters in the model

[21:46] the parameters in regression are

[21:48] estimated by minimizing some error

[21:50] function such as the sum of squares in

[21:53] regression

[21:54] whereas the corresponding thing is

[21:57] called back propagation in neural Nets

[22:00] because we randomly assign the weights

[22:02] with some numbers

[22:04] then we see how well the network

[22:06] predicts the training data

[22:09] then we go back to update the values

[22:12] so that we improve the prediction

[22:15] this is iterated until we reach some

[22:17] convergence where the network can no

[22:19] longer improve the predictions based on

[22:21] the training data

[22:24] the parameters in regression usually

[22:27] have a meaning that we can interpret

[22:29] such as the odds ratio in the district

[22:31] regression

[22:32] whereas the values of the ways in the

[22:34] neural network usually have no meaning

[22:36] rather that they're used to make a good

[22:39] prediction

[22:40] in regression we usually compute P

[22:42] values that are associated with the

[22:44] parameters which require that to fulfill

[22:47] a set of assumptions

[22:50] since we use the estimated parameters to

[22:52] interpret the regression model we must

[22:55] find the global minimum of the error

[22:57] function

[22:58] whereas this is not crucial in neural

[23:01] networks because the network might

[23:03] predict the outcome just fine even

[23:05] though it has stopped on a local minimum

[23:09] finally we'll have a look at some basic

[23:11] R code if you like to reproduce the

[23:13] numbers in this video

[23:15] R is a free software tool that you can

[23:18] download from the following website

[23:20] we first plug in the data into an R

[23:22] script

[23:24] will he use the neural net package which

[23:27] you first need to install if you have

[23:29] not already done that

[23:31] then we load this package

[23:34] and train the neural network

[23:36] where we try to predict cancer versus

[23:38] healthy based on the PSA level which is

[23:42] the only input in this case

[23:44] we plug in the training data here

[23:47] since we in our example like to chain a

[23:49] neural network without a hidden layer we

[23:52] set this argument to zero

[23:55] and use the logistic activation function

[23:58] the arrow function is here set across

[24:01] entropy which in this example is the

[24:03] same thing as using the negative log

[24:05] likelihood function to estimate the

[24:07] weights

[24:10] we can print the output

[24:12] and plot the network

[24:14] after

[24:16] 754 iterations to optimize the values of

[24:19] the weights in order to minimize the

[24:21] negative log likelihood function

[24:23] we have an error of 10.54 which in this

[24:27] case corresponds to the negative log

[24:29] likelihood value

[24:30] the smaller this value is the better the

[24:33] mole fits the data

[24:36] for example if you like to use the

[24:38] network for prediction we can use the

[24:41] function predict

[24:42] in this example the network will predict

[24:45] that the person with a PC level of 2 is

[24:48] healthy

[24:50] if you like to compare with the logistic

[24:52] regression model you can run the

[24:54] following line

[24:56] if you like to change the error function

[24:58] to the sum of squared errors you can set

[25:01] this argument to SSE

[25:03] the reason why here use a lower

[25:06] threshold for convergence compared to

[25:08] the default value of 0.01 is simply to

[25:11] get similar values as in logistic

[25:13] regression but this will be more

[25:15] computational expensive

[25:18] I also recommend that you run a number

[25:20] of repetitions

[25:22] in this example we will generate 10

[25:24] networks that have been based on

[25:26] different initial values of the weights

[25:30] then you can study the output of these

[25:32] 10 networks and select the one with the

[25:34] lowest error

[25:37] in future videos we'll have a look at

[25:40] networks where the output can result in

[25:42] several categories compared to the

[25:44] simple binary case we used in this video

[25:48] we'll also have a look at networks where

[25:50] the output is a continuous variable

[25:53] and see how the number of hidden nodes

[25:55] affects the output we will also discuss

[25:57] the problem of overfitting

[26:01] in another video I will cover non-linear

[26:03] regression which is highly related to

[26:06] neural networks

[26:07] this was the end of this basic video

[26:09] about artificial neural networks thanks

[26:12] for watching

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.