Linear Regression

📋 Overview

Linear Regression is the foundation of machine learning and statistical modeling. It models the relationship between a dependent variable and independent variables using a linear approach. Despite being simple, it's powerful and widely applicable.

🎯 Learning Objectives

Understand the mathematical foundation of linear regression
Implement linear regression from scratch and using libraries
Visualize regression results and interpret coefficients
Apply linear regression to real-world problems
Understand assumptions and limitations

⏱️ Estimated Time: 20–25 minutes reading + 45 minutes practice

Mathematical Foundation

Linear Regression models the relationship between a dependent variable $$y$$ and one or more independent variables $$x_1, x_2, ..., x_n$$ using a linear function of the model parameters.

🔍 Key Insight: "Linear" vs "Non-linear" Relationships

"Linear" refers to the relationship being linear in the model parameters $\boldsymbol{w}$ , not necessarily in the input features $\boldsymbol{x}$ .

Through feature engineering, we can transform input features to capture non-linear relationships while keeping the model linear in parameters.

Simple Linear Regression

Model: $$y = w_0 + w_1 x + \varepsilon$$

Where:

$$y$$ is the dependent variable (target)
$$x$$ is the independent variable (feature)
$$w_0$$ is the intercept (bias term)
$$w_1$$ is the slope (coefficient)
$\varepsilon$ is the error term (residuals)

Multiple Linear Regression

Model: $$y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n + \varepsilon$$

Matrix Form

For multiple features, we can express linear regression in matrix form:

Vectorized Form: $$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{w} + \boldsymbol{\varepsilon}$$

Where:

$\boldsymbol{y}$ is the target vector $(n \times 1)$
$\boldsymbol{X}$ is the feature matrix $(n \times p)$
$\boldsymbol{w}$ is the coefficient vector $(p \times 1)$
$\boldsymbol{\varepsilon}$ is the error vector $(n \times 1)$

Theoretical Foundation: Maximum Likelihood Estimation

Before diving into the cost function, let's understand why we use the sum of squared residuals. This comes from Maximum Likelihood Estimation (MLE).

MLE Derivation

We assume that the errors $\varepsilon_i$ are independently and identically distributed (i.i.d.) following a normal distribution with mean 0 and variance $\sigma^2$ :

\varepsilon_i \sim \mathcal{N}(0, \sigma^2)

This means:

p(y_i | \boldsymbol{x}_i, \boldsymbol{w}, \sigma^2) = \mathcal{N}(\boldsymbol{x}_i^T \boldsymbol{w}, \sigma^2)

The likelihood function for all $$n$$ observations is:

L(\boldsymbol{w}, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \boldsymbol{x}_i^T \boldsymbol{w})^2}{2\sigma^2}\right)

Taking the negative log-likelihood (to minimize instead of maximize):

-\log L(\boldsymbol{w}, \sigma^2) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \boldsymbol{x}_i^T \boldsymbol{w})^2

Since we only care about finding $\boldsymbol{w}$ , we can ignore the first term and the constant $\frac{1}{2\sigma^2}$ . This gives us the MSE cost function:

Objective Function

The goal is to find the parameters $\boldsymbol{w}$ that minimize the sum of squared residuals:

Cost Function (MSE): $$J(\boldsymbol{w}) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \boldsymbol{x}_i^T \boldsymbol{w})^2$$

Matrix Form: $$J(\boldsymbol{w}) = \frac{1}{2n} ||\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}||^2$$

Normal Equation (Closed-form Solution)

Taking the derivative and setting it to zero gives us the normal equation:

Normal Equation: $$\boldsymbol{w} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}$$

Derivation:

\frac{\partial J}{\partial \boldsymbol{w}} = \frac{1}{n}\boldsymbol{X}^T(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}) = 0

\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{w} = \boldsymbol{X}^T\boldsymbol{y}

\boldsymbol{w} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}

Gradient Descent (Iterative Solution)

For large datasets, we can use gradient descent:

Gradient: $$\nabla J(\boldsymbol{w}) = \frac{1}{n}\boldsymbol{X}^T(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})$$

Update Rule: $$\boldsymbol{w}^{(t+1)} = \boldsymbol{w}^{(t)} - \alpha \nabla J(\boldsymbol{w}^{(t)})$$

Where $\alpha$ is the learning rate.

Key Assumptions

📊 Linear Relationship

The relationship between independent and dependent variables should be linear.

🎯 Independence

Observations should be independent of each other (no autocorrelation).

📈 Homoscedasticity

Residuals should have constant variance (homoscedasticity).

🔔 Normality

Residuals should be normally distributed.

🚫 No Multicollinearity

Independent variables should not be highly correlated with each other.

✅ No Outliers

The model should not be unduly influenced by extreme values.

Feature Engineering: Capturing Non-linear Relationships

🎯 Key Understanding: "Linear" in Linear Regression

Linear Regression: The word "linear" refers to a degree 1 relationship with model parameters $\boldsymbol{w}$

Linear Regression: Can be used to model non-linear x-y relationships through feature engineering

Although Linear Regression is "linear" in parameters, it can model complex non-linear relationships through feature engineering:

Polynomial Features

Transform input features to polynomial terms:

Quadratic: $$y = w_0 + w_1 x + w_2 x^2 + \varepsilon$$

Cubic: $$y = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + \varepsilon$$

Interaction Terms

Include products of features to capture interactions:

y = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_1 x_2 + \varepsilon

Basis Functions

Use various basis functions for flexible modeling:

y = w_0 + w_1 \sin(x) + w_2 \cos(x) + w_3 \log(x) + \varepsilon

Example: House Price with Non-linear Features

\text{Price} = w_0 + w_1 \text{Size} + w_2 \text{Size}^2 + w_3 \text{Bedrooms} + w_4 \text{Size} \times \text{Bedrooms} + \varepsilon

This model can capture:

Size²: Diminishing returns (price increase slows as size grows)
Size × Bedrooms: Interaction between size and bedroom count

Applications

Economics: Demand forecasting, price prediction, GDP modeling
Healthcare: Medical diagnosis, treatment outcome prediction, drug effectiveness
Business: Sales forecasting, risk assessment, customer lifetime value
Engineering: Quality control, system modeling, performance optimization
Social Sciences: Policy analysis, behavioral studies, demographic modeling

Interactive Visualization

Explore how linear regression fits a line through data points:

Model Evaluation

R-squared (Coefficient of Determination)

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}

Mean Squared Error (MSE)

\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Root Mean Squared Error (RMSE)

\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}

Mean Absolute Error (MAE)

\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

💻 Code Examples

NumPy, scikit-learn, and PyTorch implementations

📊 Advanced Topics

Ridge, Lasso, and Elastic Net regression

🏋️ Exercises

Hands-on practice problems

Detailed Example: House Price Prediction

Let's work through a practical example of predicting house prices based on size and number of bedrooms.

Sample Data

House	Size (sq ft)	Bedrooms	Price ($)
1	2100	3	399,900
2	1600	3	329,900
3	2400	3	369,000
4	1416	2	232,000
5	3000	4	539,900

Model Setup

We want to predict price based on size and bedrooms:

\text{Price} = w_0 + w_1 \times \text{Size} + w_2 \times \text{Bedrooms} + \varepsilon

Matrix Formulation

\boldsymbol{X} = \begin{bmatrix} 1 & 2100 & 3 \\ 1 & 1600 & 3 \\ 1 & 2400 & 3 \\ 1 & 1416 & 2 \\ 1 & 3000 & 4 \end{bmatrix}, \quad \boldsymbol{y} = \begin{bmatrix} 399900 \\ 329900 \\ 369000 \\ 232000 \\ 539900 \end{bmatrix}

Solution

Using the normal equation:

\boldsymbol{\beta} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}

\boldsymbol{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} 89597.05 \\ 139.21 \\ -8738.02 \end{bmatrix}

Final Model

\text{Price} = 89597.05 + 139.21 \times \text{Size} - 8738.02 \times \text{Bedrooms}

Interpretation

Intercept $w_0$ ($89,597): Base price when size and bedrooms are zero
Size coefficient $w_1$ ($139.21): Each additional square foot adds $139.21 to the price
Bedrooms coefficient $w_2$ (-$8,738): Each additional bedroom decreases price by $8,738 (counterintuitive, possibly due to correlation with other factors)