Linear Regression
📋 Overview
Linear Regression is the foundation of machine learning and statistical modeling. It models the relationship between a dependent variable and independent variables using a linear approach. Despite being simple, it's powerful and widely applicable.
🎯 Learning Objectives
- Understand the mathematical foundation of linear regression
- Implement linear regression from scratch and using libraries
- Visualize regression results and interpret coefficients
- Apply linear regression to real-world problems
- Understand assumptions and limitations
Mathematical Foundation
Linear Regression models the relationship between a dependent variable $y$ and one or more independent variables $x_1, x_2, ..., x_n$ using a linear function of the model parameters.
🔍 Key Insight: "Linear" vs "Non-linear" Relationships
"Linear" refers to the relationship being linear in the model parameters $\boldsymbol{w}$, not necessarily in the input features $\boldsymbol{x}$.
Through feature engineering, we can transform input features to capture non-linear relationships while keeping the model linear in parameters.
Simple Linear Regression
$$y = w_0 + w_1 x + \varepsilon$$
Where:
- $y$ is the dependent variable (target)
- $x$ is the independent variable (feature)
- $w_0$ is the intercept (bias term)
- $w_1$ is the slope (coefficient)
- $\varepsilon$ is the error term (residuals)
Multiple Linear Regression
$$y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n + \varepsilon$$
Matrix Form
For multiple features, we can express linear regression in matrix form:
$$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{w} + \boldsymbol{\varepsilon}$$
Where:
- $\boldsymbol{y}$ is the target vector $(n \times 1)$
- $\boldsymbol{X}$ is the feature matrix $(n \times p)$
- $\boldsymbol{w}$ is the coefficient vector $(p \times 1)$
- $\boldsymbol{\varepsilon}$ is the error vector $(n \times 1)$
Theoretical Foundation: Maximum Likelihood Estimation
Before diving into the cost function, let's understand why we use the sum of squared residuals. This comes from Maximum Likelihood Estimation (MLE).
MLE Derivation
We assume that the errors $\varepsilon_i$ are independently and identically distributed (i.i.d.) following a normal distribution with mean 0 and variance $\sigma^2$:
This means:
The likelihood function for all $n$ observations is:
Taking the negative log-likelihood (to minimize instead of maximize):
Since we only care about finding $\boldsymbol{w}$, we can ignore the first term and the constant $\frac{1}{2\sigma^2}$. This gives us the MSE cost function:
Objective Function
The goal is to find the parameters $\boldsymbol{w}$ that minimize the sum of squared residuals:
$$J(\boldsymbol{w}) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \boldsymbol{x}_i^T \boldsymbol{w})^2$$
$$J(\boldsymbol{w}) = \frac{1}{2n} ||\boldsymbol{y} - \boldsymbol{X}\boldsymbol{w}||^2$$
Normal Equation (Closed-form Solution)
Taking the derivative and setting it to zero gives us the normal equation:
$$\boldsymbol{w} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}$$
Derivation:
Gradient Descent (Iterative Solution)
For large datasets, we can use gradient descent:
$$\nabla J(\boldsymbol{w}) = \frac{1}{n}\boldsymbol{X}^T(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})$$
$$\boldsymbol{w}^{(t+1)} = \boldsymbol{w}^{(t)} - \alpha \nabla J(\boldsymbol{w}^{(t)})$$
Where $\alpha$ is the learning rate.
Key Assumptions
📊 Linear Relationship
The relationship between independent and dependent variables should be linear.
🎯 Independence
Observations should be independent of each other (no autocorrelation).
📈 Homoscedasticity
Residuals should have constant variance (homoscedasticity).
🔔 Normality
Residuals should be normally distributed.
🚫 No Multicollinearity
Independent variables should not be highly correlated with each other.
✅ No Outliers
The model should not be unduly influenced by extreme values.
Feature Engineering: Capturing Non-linear Relationships
🎯 Key Understanding: "Linear" in Linear Regression
Linear Regression: The word "linear" refers to a degree 1 relationship with model parameters $\boldsymbol{w}$
Linear Regression: Can be used to model non-linear x-y relationships through feature engineering
Although Linear Regression is "linear" in parameters, it can model complex non-linear relationships through feature engineering:
Polynomial Features
Transform input features to polynomial terms:
$$y = w_0 + w_1 x + w_2 x^2 + \varepsilon$$
$$y = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + \varepsilon$$
Interaction Terms
Include products of features to capture interactions:
Basis Functions
Use various basis functions for flexible modeling:
Example: House Price with Non-linear Features
This model can capture:
- Size²: Diminishing returns (price increase slows as size grows)
- Size × Bedrooms: Interaction between size and bedroom count
Applications
- Economics: Demand forecasting, price prediction, GDP modeling
- Healthcare: Medical diagnosis, treatment outcome prediction, drug effectiveness
- Business: Sales forecasting, risk assessment, customer lifetime value
- Engineering: Quality control, system modeling, performance optimization
- Social Sciences: Policy analysis, behavioral studies, demographic modeling
Interactive Visualization
Explore how linear regression fits a line through data points:
Model Evaluation
R-squared (Coefficient of Determination)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
💻 Code Examples
NumPy, scikit-learn, and PyTorch implementations
📊 Advanced Topics
Ridge, Lasso, and Elastic Net regression
🏋️ Exercises
Hands-on practice problems
Detailed Example: House Price Prediction
Let's work through a practical example of predicting house prices based on size and number of bedrooms.
Sample Data
House | Size (sq ft) | Bedrooms | Price ($) |
---|---|---|---|
1 | 2100 | 3 | 399,900 |
2 | 1600 | 3 | 329,900 |
3 | 2400 | 3 | 369,000 |
4 | 1416 | 2 | 232,000 |
5 | 3000 | 4 | 539,900 |
Model Setup
We want to predict price based on size and bedrooms:
Matrix Formulation
Solution
Using the normal equation:
Final Model
Interpretation
- Intercept $w_0$ ($89,597): Base price when size and bedrooms are zero
- Size coefficient $w_1$ ($139.21): Each additional square foot adds $139.21 to the price
- Bedrooms coefficient $w_2$ (-$8,738): Each additional bedroom decreases price by $8,738 (counterintuitive, possibly due to correlation with other factors)