Stanford CS229: Machine Learning - Linear Regression and Gradient Descent | Lecture 2 (Autumn 2018)

1h 18m video Published Apr 17, 2020 Transcribed Jul 28, 2026 S Stanford Online

Intermediate 16 min read For: Students or professionals with basic calculus and linear algebra knowledge who are new to machine learning.

AI Trust Score 95/100

✅ Highly Legit

"The title accurately describes the lecture content, which covers linear regression, gradient descent, and the normal equation."

AI Summary

This lecture introduces linear regression, a foundational supervised learning algorithm. It covers the hypothesis representation, the cost function, and two methods for minimizing it: gradient descent (both batch and stochastic) and the normal equation. The lecture also establishes key notation and concepts used throughout the course.

Chapters

1 Introduction to Linear Regression 00:03 2 Notation and Hypothesis Representation 04:43 3 The Cost Function 12:51 4 Gradient Descent Algorithm 18:04 5 Batch vs. Stochastic Gradient Descent 33:34 6 The Normal Equation 42:00 7 Derivation and Conclusion 54:42

[01:22]

Regression vs. Classification

Supervised learning regression problems have a continuous output, unlike classification problems which have discrete outputs.

[05:00]

Hypothesis Representation

The hypothesis h(x) = theta_0 + theta_1*x_1 + ... + theta_n*x_n, where x_0 = 1 for convenience.

[08:56]

Notation: m, n, and x^{(i)}

m = number of training examples, n = number of features. x^{(i)} denotes the i-th training example.

[16:28]

Cost Function J(theta)

J(theta) = 1/2 * sum_{i=1}^{m} (h_theta(x^{(i)}) - y^{(i)})^2. The 1/2 simplifies derivative calculations.

[23:03]

Gradient Descent Update Rule

Gradient descent iteratively updates parameters: theta_j := theta_j - alpha * (partial derivative of J w.r.t. theta_j).

[41:10]

Batch vs. Stochastic Gradient Descent

Batch gradient descent uses all m examples per update; stochastic gradient descent uses one example per update, making it faster for large datasets.

[54:42]

The Normal Equation

The normal equation provides a closed-form solution: theta = (X^T X)^{-1} X^T y, where X is the design matrix.

Mentioned in this Video

CS229 Lecture Notes

link

Craigslist

service

ALVINN

link

Tutorial Checklist

1 02:01 Collect a dataset of houses with features (e.g., size, bedrooms) and prices.

2 05:00 Define the hypothesis: h_theta(x) = theta_0 + theta_1*x_1 + ... + theta_n*x_n, with x_0 = 1.

3 16:28 Define the cost function: J(theta) = 1/2 * sum_{i=1}^{m} (h_theta(x^{(i)}) - y^{(i)})^2.

4 18:27 Initialize theta (e.g., to zeros) and choose a learning rate alpha.

5 33:11 Repeat until convergence: For each j, update theta_j := theta_j - alpha * (1/m) * sum_{i=1}^{m} (h_theta(x^{(i)}) - y^{(i)}) * x_j^{(i)} (batch gradient descent).

6 44:55 Alternatively, use stochastic gradient descent: For each i from 1 to m, update theta_j := theta_j - alpha * (h_theta(x^{(i)}) - y^{(i)}) * x_j^{(i)}.

7 54:42 For a direct solution, use the normal equation: theta = (X^T X)^{-1} X^T y, where X is the design matrix.

Study Flashcards (11)

What is a regression problem?

easy Click to reveal answer

A supervised learning problem where the output variable is a continuous value.