---
title: 'Artificial neural networks (ANN) - explained super simple'
source: 'https://youtube.com/watch?v=XxZ0BibMTjw'
video_id: 'XxZ0BibMTjw'
date: 2026-07-28
duration_sec: 1574
---

# Artificial neural networks (ANN) - explained super simple

> Source: [Artificial neural networks (ANN) - explained super simple](https://youtube.com/watch?v=XxZ0BibMTjw)

## Summary

This video provides a basic introduction to artificial neural networks (ANNs) using a simple example: predicting prostate cancer from PSA levels. It starts with a network without hidden layers, showing it is mathematically identical to logistic regression, then demonstrates how adding a hidden layer enables the network to learn complex, non-linear patterns. The video also covers weight optimization via gradient descent and includes R code for reproduction.

### Key Points

- **Basic Structure** [0:05] — A neural network consists of input nodes, a hidden layer, and output nodes. Example: three inputs (age, PSA concentration, MRI score) to predict prostate cancer.
- **Simple Network Example** [1:43] — A simple network with one input (PSA), no hidden layer, and two outputs (cancer/healthy) is trained on 14 patients. It uses a sigmoid activation function.
- **Equivalence to Logistic Regression** [7:12] — The simple neural network with logistic activation function is identical to logistic regression. The bias corresponds to the intercept, and the weight to the coefficient.
- **Training Accuracy** [8:55] — The network achieved 86% accuracy (12/14 correct) on the training data, but cross-validation is recommended for fair evaluation.
- **Weight Optimization** [10:46] — Weights are optimized by minimizing an error function (e.g., sum of squared errors or negative log-likelihood) using gradient descent. The algorithm can get stuck in local minima, so multiple random starts are recommended.
- **Power of Hidden Layers** [17:47] — A hidden layer allows the network to learn non-linear curves (e.g., an 'M-shaped' curve) that can perfectly separate data where healthy individuals have intermediate values and cancer patients have low or high values.
- **Weights vs. Regression Coefficients** [22:32] — In neural networks, weights are usually not interpretable (unlike regression coefficients). The goal is prediction, not interpretation. Local minima may still yield good predictions.
- **R Code Example** [23:09] — R code using the 'neuralnet' package is provided to reproduce the examples, including training, prediction, and comparison with logistic regression.

## Transcript

welcome to this basic video about
artificial neural networks
a neural network consists of input nodes
a hidden layer
and output nodes
for example these input nodes may
represent three measurements that we do
to determine if someone has prostate
cancer or not
suppose that the person enters the
hospital
we check the age
and the concentration of the prostate
specific antigen from a blood sample
and collect a score between 1 and 5 from
an MRI scan of the prostate
then we plug in the corresponding values
in the input nodes
and use the network to tell if the
person has prostate cancer or not
to understand how neural Nets work we
will here have a look at the super
simple example
at the end of this video we'll try to
understand what happens if you include a
hidden layer
a look at some simple R code if you like
to reproduce the first example shown in
this video
we will here see how to train a neural
network to predict if someone has
prostate cancer based on the PSA level
our training data consists of seven
patients that we know have prostate
cancer
and seven individuals that we know are
healthy based on a blood sample
we have determined the concentration of
PSA or the prostate specific antigen in
the blood of these individuals
note that this data has been simulated
for the purpose of this video
we'll here use the simplest possible
neural network model
since we only have one measured variable
the PSA concentration the network will
only have just one input node
note that we here do not have any hidden
layer
because we like to use the simplest
possible Network
the network has two output nodes because
we here like to use the network to
predict if someone has prostate cancer
or not
this is called a bias which is used to
modify the activation function
there are several different activation
functions to select between
we will here use the sigmoid function
which is the same function that is used
in logistic regression
this will later allow us to compare the
neural network with logistic regression
let's make a plot of this data
like this
where these points represent the cancer
patients that are hair coded as once
whereas these points represents the
healthy individuals that are coded as
zeros
for example this healthy individual has
a PSA level of 2.5
whereas this cancer patient has a PSA
level of 2.1
training a neural network means that we
find the optimal values of the weights
and the weight associated with the
buyers
so that the activation function in this
case
generates a sigmoid curve that can be
used to predict the outcome in an
optimal way
note that the logistic activation
function generates a value between 0 and
1.
and that e is the Euler's number
if you train this simple Network we will
obtain the following weights
let's use this network to see if this
healthy person is correctly predicted as
being healthy by the network
we plug in its PSA level of 2.0 in the
input node
to calculate the probability that the
person has cancer
we use the weights associated with
output node
in this equation
where W cro is the bias weight
and this is the way for the input signal
whereas this is the value of the input
node
since we only have one weight that is
associated with the output node n is
here equal to 1.
let's plug in the weights in the
equation
and the value of the input node
and do the math
we now plug in this value in the
logistic activation function
and calculate
this value is the value for the output
node cancer
which can in this case be seen as the
probability that the person has prostate
cancer
let's place this value here
we'll now calculate the value of this
output node
where we plug in the corresponding
values in the equation
and do the math
let's place the corresponding value here
which can be seen as the probability
that the person is healthy
we can now use some can of threshold to
determine if you should classify the
individual to have prostate cancer or
not based on these two values
a common kind of value to use in binary
classification is 0.5 because it is the
value between 0 and 1.
based on this cutoff value the network
will predict that the person with the
PHA level of 2 is healthy or cost output
value associated with the healthy class
is greater than 0.5
note that this value is exactly the same
value
as if we would draw a vertical line from
2 to the logistic curve
and then the horizontal line from the
curve to the y-axis
the height of the activation function is
therefore equal to 0.438
this value is then simply just 1 minus
0.438
this value is equal to the height of the
flipped curve
because the slope of the curve should be
negative if you go this way in the
network
if you have seen my previous videos
about logistic regression you will note
that such a model reproduced exact same
results
the estimated parameters of the logistic
regression model
are identical to the weights in this
simple neural network
P0 or The Intercept corresponds to the
bias
whereas B1 corresponds to the weight
associated with the output node for
prostate cancer
if we plug in this equation into the
activation function
we will obtain the exact same function
that is used in logistic regression
we can therefore conclude that this
simple neural network using the logistic
activation function
is identical to logistic regression
let's see how well this neural network
can predict the class of our training
data
this is the data that we used to train
the network and this is the values
generated by the output node for cancer
which can be obtained if you plug in
these values in the equation
and the weights
and then compute the corresponding
values by the activation function
for example this is the value that we
calculated by hand previously based on
the PSA level of 2.0
which corresponds to the height of the
activation function
we can now use the output values and the
cutoff values 0.5 to see how well the
model predicts the training data
since these individuals had values
greater than 0.5 they are predicted to
have prostate cancer
since you know that these individuals
actually have prostate cancer we know
that the neural network has made the
correct predictions for these
individuals
however the network incorrectly predicts
that this cancer patient is healthy
these individuals have output values
that are less than 0.5
which means that they are predicted to
be healthy since we know that these
individuals are healthy we know that
they have been correctly classified by
the network
this healthy person is incorrectly
classified as having prostate cancer
because it has a relatively high PSA
level
in total the network makes 12 correct
predictions out of 14 possible
which gives an accuracy of about 86
percent
however to get the fair estimate of how
well the network would perform on new
data
we should use a test data set or
cross-validation as we have done in all
other machine learning methods that we
have discussed so far
watch the video about validation to see
how to evaluate the classifier by the
holdout method or by cross-validation
once they have established a neural
network that has been trained on some
data we can use it for prediction
suppose that we like to test if this
person has prostate cancer or not
from a blood sample
we measure the PSA concentration to 1.75
nanograms per ml
we plug in the value in the network
we can then compute the values of the
output nodes
we therefore plug in the value of the
input node here
the weight of the buyers
and this weight here
and do the map
then we plug in this value in the
activation function
and do the math again
1 minus 0.281 is 0.719
since this value is larger than the
cutoff value of 0.5 the network predicts
that the person is healthy
we'll now try to understand where the
values for the weights come from
in this example these are the optimal
values for the weights
whereas these are the optimal values for
the bias weights
the weights are optimized by using some
sort of cost function
usually the maximum likelihood method
for binary classification that we have
covered in the videos about logistic
regression
the following weights were optimized by
the negative log likelihood function
we can also use the method of ordinary
squares I will here explain the concept
of ordinary squares because it is a bit
simpler to understand
in the videos about logistic regression
I show how to calculate the log
likelihood
and in another video I share the
difference between orderly squares and
the maximum likelihood method based on
the simple linear model
when we use the method of lead squares
we try to minimize this function that
computes the sum of the squared errors
these are the Y values of the
observations which are in this example
equal to zero if it is a healthy person
and one if it is a patient with prostate
cancer
y hat is the value of the reactivation
function that can go between 0 and 1.
this difference is called a residual or
an error which can be seen as the
distance between observations and the
Curve
the Y value of this data point is zero
and the value of the curve on the PSA
level is equal to 2.5 is 0.75
which can be calculated like this based
on the weights we previously estimated
this results in a residual of negative
0.75
we then Square this residual
we then do the same calculations for the
next data point and so forth
then we sum all the square residuals
which here results in a value of 1.78
suppose that we now change the value of
the bias from negative 5.754
to negative 8.
that will move the curve a bit to the
right
which will increase the sum of squared
errors because the data points are now
much farther away from the Curve
let's paste the point here which
represents that the sum of squared
residuals or errors is 2.88 and the bias
is set to negative 8.
when the bias is set to negative 5.754
the sum of the squared errors was equal
to 1.78
if you set the bias to negative 3
we'll move the curve a bit to the left
which will result in that the sum of the
squared errors is equal to 3.62
if we would try many different values of
the buyer's weight
we will be able to generate a curve that
shows at the sum of the squared errors
changes we will change the value or the
weight associated with the bias
the method of least squares finds the
value of the weight that results in the
lowest possible sum of squared errors
this explains whether weight of the bias
is equal to about negative 5.8 in this
example because that value results in a
curve that is as close as possible to
the data points which we'll result in
that the network predicts the prostate
cancer in the best way
to find the value of the weight that
results in the lowest sum of squared
errors
we need to start with an initial guess
suppose that we initially guess the
value or the weight to negative three
then we use an algorithm such as the
gradient descent where we step by smooth
along the direction of the steepest
descent
like this
If instead initially guess the value of
the bias to negative 9.
the gradient descent method will instead
increase the value of the weight until
it finds the value that minimizes the
sum of the squared errors
one problem occurs when we had a more
complicated Network because the error
function for such a network would then
have several local minimals
this means that if initially gets the
value of the bias to negative 9
a method will move down to the local
minimum and Report negative 8 as the
best value
which will result in a bad fit to the
data
if you initially guess the value to
negative 6.5
or negative 3.5
then we will reach the global minimum
which results in the best fit
it is therefore important that we try
many initial guesses or the weights
during the learning process to find a
network that can predict the data in the
best way
most software tools Generate random
initial guesses or the weights
which means that you might get different
results every time you train your
network
we can make a similar error curve if we
change this weight
when we change this weight the Curve
will change its slope
when we estimate to weight
simultaneously the error function will
correspond to surface in a
three-dimensional plot like this
the combined values of the two weights
that result in the minimum value of this
function will result in the best fit to
the data
we'll now try to understand the purpose
of using a hidden layer in neural
networks
to show the beauty of the Hidden layers
suppose that we have the following data
where one has measured a certain protein
in blood samples from 10 cancer patients
and five Health individuals
people who are healthy have an
intermediate level of this protein in
their blood
or as people with prostate cancer either
have a low level
or a high level
if you would try to fit a logistic
regimal would trade a neural network
with a logistic activation function
without a hidden layer
maybe failed to make a good prediction
of the training data because an s-shaped
curve is simply not useful for this type
of data
these cancer patients will incorrectly
be predicted to be healthy by them all
and if you flip the curve these patients
will instead incorrectly be predicted to
be healthy
the car might also look like this
where all the healthy individuals will
now be incorrectly predicted to have
prostate cancer
what we want is a car that looks like
this
this is where the hidden layer comes in
because a neural network with one or
several hidden layers can generate cool
curves like this
we'll here extend our previous network
with one hidden layer
if you train this network
on the following data
will obtain the following values or the
weights
node if you try to compute the neural
network on the same data you will end up
with different values compared to the
one shown here because you will probably
end up on a different local minimum to
understand the calculations behind this
neural network let's see how it predicts
a person with a value of 3 or protein X
the value of this node
can be calculated like this
which results in a value that is
approximately equal to one
let's set the value of this node to 1.
the corresponding value of this node is
0.869
the value of this output node is
calculated based on
the bias
and the sum of the weight multiplied by
the first node in the hidden layer
and the way it multiplied by the second
node in Hidden layer
which results in a value that is
approximately equal to one
similarly the value of the second output
node is approximately equal to zero
if the value of the input node is 3.
the network will predict that the person
has prostate cancer
let's focus on how this output node is
calculated
this is how this node was calculated
and this is how the values of the nodes
in a hidden layer were calculated
let's substitute X1 and x2 in the
equation
like this
and we plug in this equation in
activation function here
so that we have the following function
that can compute the value of the output
node based on any input value
let's move the function up here
if you now calculate the value of the
function based on a range of different
values of X between 0 and 4. we will be
able to generate a curve with this shape
which predicts with 100 accuracy
because all individuals with prostate
cancer will get a value of approximately
one and will therefore be correctly
predicted to have prostate cancer
whereas all the healthy individuals will
get a predicted value of approximately
zero which will result in that they are
correctly predicted to be healthy
similarly the corresponding function of
this output node
will generate the opposite curve
this is the magical stuff behind neural
networks they can help us to generate
non-linear functions that can fit to
almost any data that seems impossible to
do with standard statistical methods
we'll now discuss a few similarities and
differences between neural networks and
standard statistical models such as the
logistic regression model
the input in neural Nets is usually
called a predictor or explanatory
variable in logistic regression
the output in neural Nets is usually
called the response variable in logistic
regression
the weights in neural networks are
usually called coefficients or
parameters in logistic regression
The Intercept in regression corresponds
to the bias
we usually say that we train the neural
network whereas the corresponding thing
in regression is called that we fit them
all to the data would that be estimate
the parameters in the model
the parameters in regression are
estimated by minimizing some error
function such as the sum of squares in
regression
whereas the corresponding thing is
called back propagation in neural Nets
because we randomly assign the weights
with some numbers
then we see how well the network
predicts the training data
then we go back to update the values
so that we improve the prediction
this is iterated until we reach some
convergence where the network can no
longer improve the predictions based on
the training data
the parameters in regression usually
have a meaning that we can interpret
such as the odds ratio in the district
regression
whereas the values of the ways in the
neural network usually have no meaning
rather that they're used to make a good
prediction
in regression we usually compute P
values that are associated with the
parameters which require that to fulfill
a set of assumptions
since we use the estimated parameters to
interpret the regression model we must
find the global minimum of the error
function
whereas this is not crucial in neural
networks because the network might
predict the outcome just fine even
though it has stopped on a local minimum
finally we'll have a look at some basic
R code if you like to reproduce the
numbers in this video
R is a free software tool that you can
download from the following website
we first plug in the data into an R
script
will he use the neural net package which
you first need to install if you have
not already done that
then we load this package
and train the neural network
where we try to predict cancer versus
healthy based on the PSA level which is
the only input in this case
we plug in the training data here
since we in our example like to chain a
neural network without a hidden layer we
set this argument to zero
and use the logistic activation function
the arrow function is here set across
entropy which in this example is the
same thing as using the negative log
likelihood function to estimate the
weights
we can print the output
and plot the network
after
754 iterations to optimize the values of
the weights in order to minimize the
negative log likelihood function
we have an error of 10.54 which in this
case corresponds to the negative log
likelihood value
the smaller this value is the better the
mole fits the data
for example if you like to use the
network for prediction we can use the
function predict
in this example the network will predict
that the person with a PC level of 2 is
healthy
if you like to compare with the logistic
regression model you can run the
following line
if you like to change the error function
to the sum of squared errors you can set
this argument to SSE
the reason why here use a lower
threshold for convergence compared to
the default value of 0.01 is simply to
get similar values as in logistic
regression but this will be more
computational expensive
I also recommend that you run a number
of repetitions
in this example we will generate 10
networks that have been based on
different initial values of the weights
then you can study the output of these
10 networks and select the one with the
lowest error
in future videos we'll have a look at
networks where the output can result in
several categories compared to the
simple binary case we used in this video
we'll also have a look at networks where
the output is a continuous variable
and see how the number of hidden nodes
affects the output we will also discuss
the problem of overfitting
in another video I will cover non-linear
regression which is highly related to
neural networks
this was the end of this basic video
about artificial neural networks thanks
for watching