MSE Loss Function
📋 Overview
Mean Squared Error (MSE) is the fundamental loss function for regression problems. This page covers the mathematical foundation, properties, and practical applications of MSE loss.
🎯 Learning Objectives
- Understand when to use MSE vs. other loss functions
- Implement MSE from scratch in different frameworks
- Visualize MSE behavior and compare with other losses
- Apply MSE in real regression problems
Mathematical Foundation
MSE measures the average squared differences between predicted and actual values:
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Where:
- $y_i$ is the true value for sample $i$
- $\hat{y}_i$ is the predicted value for sample $i$
- $n$ is the number of samples
Theoretical Foundation of MSE
1) Maximum Likelihood Estimation (MLE) Perspective
MSE naturally arises from Maximum Likelihood Estimation under the assumption of Gaussian noise. Let's derive it step by step:
Problem Setup
Assume we have a regression model $f(\boldsymbol{x}; \boldsymbol{\theta})$ that predicts continuous values. The true relationship is:
Where $\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise with mean 0 and variance $\sigma^2$.
Likelihood Function
Log-Likelihood
Negative Log-Likelihood
Connection to MSE
Since $\sigma^2$ is constant and does not depend on $\boldsymbol{\theta}$, maximizing likelihood is equivalent to minimizing the sum of squared errors. Dividing by $n$ gives the Mean Squared Error:
Key insight: MSE is the negative log-likelihood (up to constants) under Gaussian noise. This explains why MSE is the “natural” loss function for regression.
2) Information Theory Perspective
MSE can also be understood from an information-theoretic perspective. Suppose the true distribution of an observation is Gaussian $\mathcal{N}(y, \sigma^2)$, while the model’s predicted distribution is $\mathcal{N}(\hat{y}, \sigma^2)$, both with the same variance $\sigma^2$.
KL Divergence Definition
The Kullback–Leibler (KL) divergence between two distributions $p$ and $q$ is:
KL Divergence Between Gaussians
Let $p(x) = \mathcal{N}(\mu_1,\sigma^2)$ and $q(x) = \mathcal{N}(\mu_2,\sigma^2)$. Then:
Taking the log and expectation under $p(x)$:
Since $x \sim \mathcal{N}(\mu_1, \sigma^2)$:
- $\mathbb{E}[(x-\mu_1)^2] = \sigma^2$
- $\mathbb{E}[(x-\mu_2)^2] = \sigma^2 + (\mu_1-\mu_2)^2$
Substituting back:
Substitution for Regression
Setting $\mu_1 = y$ (true value) and $\mu_2 = \hat{y}$ (prediction), we get:
Cross-Entropy Relation
Cross-entropy between two distributions is related to KL divergence:
Here $H(p)$ is the entropy of the true distribution and does not depend on model parameters. Therefore, minimizing the cross-entropy is equivalent to minimizing the KL divergence.
Connection to MSE
Since $D_{\mathrm{KL}}$ between the true Gaussian and predicted Gaussian reduces to the squared error scaled by $1/(2\sigma^2)$, minimizing KL divergence is the same as minimizing:
Normalizing by $n$ gives the familiar Mean Squared Error:
Key insight: MSE is proportional to the KL divergence between the true data-generating Gaussian and the model’s predicted Gaussian. Thus, MSE can be viewed not only as the maximum likelihood loss under Gaussian noise but also as the cross-entropy (up to constants) between two Gaussian distributions.
3) Bias–Variance Decomposition
For any estimator $\hat{y}$ of $y$, MSE decomposes into bias and variance:
This shows MSE captures both variance (fluctuations of predictions) and squared bias (systematic error).
Key Takeaways
- Statistical view: MSE is the NLL under Gaussian noise.
- Information-theoretic view: MSE is proportional to KL divergence between two Gaussians with the same variance.
- Bias–variance tradeoff: MSE balances variance and bias in predictions.
Key Properties
- Non-negative: MSE is always ≥ 0, equals 0 only when predictions are perfect
- Quadratic penalty: Large errors are penalized more heavily than small errors
- Differentiable: Smooth gradient for optimization algorithms
- Scale-dependent: MSE units are the square of the target variable units
- MLE-derived: Natural loss function under Gaussian noise assumption
- Bias-variance decomposition: Balances prediction bias and variance
Gradient Analysis
The gradient of MSE with respect to predictions is:
Key insights:
- Gradient is proportional to the prediction error
- Larger errors produce stronger gradients (faster learning)
- Gradient direction pushes predictions toward true values
MSE vs. Other Loss Functions
📊 MSE vs. MAE
MSE (Mean Squared Error):
- Penalizes large errors heavily
- Sensitive to outliers
- Smooth gradients
- Good for normal distributions
MAE (Mean Absolute Error):
- Linear penalty for all errors
- Robust to outliers
- Non-smooth at zero error
- Good for skewed distributions
🎯 MSE vs. CrossEntropy
MSE:
- For regression (continuous values)
- Outputs can be any real number
- Quadratic penalty
- No probability interpretation
CrossEntropy:
- For classification (discrete classes)
- Outputs are probabilities [0,1]
- Logarithmic penalty
- Probability interpretation
Applications in Regression
- Linear Regression: Classic use case for MSE
- Neural Network Regression: Predicting house prices, stock prices
- Time Series Forecasting: Weather prediction, sales forecasting
- Computer Vision: Image denoising, super-resolution
- Signal Processing: Audio reconstruction, sensor calibration
Interactive Visualization
Explore how MSE behaves with different prediction errors:
Detailed Calculation Examples
Let's work through several examples to understand MSE behavior:
Example 1: Perfect Predictions
Sample | True Value ($y$) | Prediction ($\hat{y}$) | Error ($y - \hat{y}$) | Squared Error |
---|---|---|---|---|
1 | 10.0 | 10.0 | 0.0 | 0.0 |
2 | 5.0 | 5.0 | 0.0 | 0.0 |
3 | 8.0 | 8.0 | 0.0 | 0.0 |
MSE | - | - | - | 0.0 |
Example 2: Some Errors
Sample | True Value ($y$) | Prediction ($\hat{y}$) | Error ($y - \hat{y}$) | Squared Error |
---|---|---|---|---|
1 | 10.0 | 9.0 | 1.0 | 1.0 |
2 | 5.0 | 6.0 | -1.0 | 1.0 |
3 | 8.0 | 7.5 | 0.5 | 0.25 |
MSE | - | - | - | 0.75 |
$$\text{MSE} = \frac{1}{3} \times (1.0 + 1.0 + 0.25) = \frac{2.25}{3} = 0.75$$
Example 3: Large Errors (Outliers)
Sample | True Value ($y$) | Prediction ($\hat{y}$) | Error ($y - \hat{y}$) | Squared Error |
---|---|---|---|---|
1 | 10.0 | 8.0 | 2.0 | 4.0 |
2 | 5.0 | 5.0 | 0.0 | 0.0 |
3 | 8.0 | 15.0 | -7.0 | 49.0 |
MSE | - | - | - | 17.67 |
Key observation: The large error (-7.0) in sample 3 contributes 49.0 to the MSE, showing how MSE is sensitive to outliers.