MSE Loss Function

📋 Overview

Mean Squared Error (MSE) is the fundamental loss function for regression problems. This page covers the mathematical foundation, properties, and practical applications of MSE loss.

🎯 Learning Objectives

Understand when to use MSE vs. other loss functions
Implement MSE from scratch in different frameworks
Visualize MSE behavior and compare with other losses
Apply MSE in real regression problems

⏱️ Estimated Time: 15–20 minutes reading + 30 minutes practice

Mathematical Foundation

MSE measures the average squared differences between predicted and actual values:

MSE Formula: $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Where:

$$y_i$$ is the true value for sample $$i$$
$\hat{y}_i$ is the predicted value for sample $$i$$
$$n$$ is the number of samples

Theoretical Foundation of MSE

1) Maximum Likelihood Estimation (MLE) Perspective

MSE naturally arises from Maximum Likelihood Estimation under the assumption of Gaussian noise. Let's derive it step by step:

Problem Setup

Assume we have a regression model $f(\boldsymbol{x}; \boldsymbol{\theta})$ that predicts continuous values. The true relationship is:

y^{(i)} = f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}) + \epsilon^{(i)}

Where $\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise with mean 0 and variance $\sigma^2$.

Likelihood Function

L(\boldsymbol{\theta}) = \prod_{i=1}^{n} p(y^{(i)} | \boldsymbol{x}^{(i)}, \boldsymbol{\theta})

p(y^{(i)} | \boldsymbol{x}^{(i)}, \boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2}{2\sigma^2}\right)

Log-Likelihood

\log L(\boldsymbol{\theta}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n} (y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2

Negative Log-Likelihood

-\log L(\boldsymbol{\theta}) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^{n} (y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2

Connection to MSE

Since $\sigma^2$ is constant and does not depend on $\boldsymbol{\theta}$, maximizing likelihood is equivalent to minimizing the sum of squared errors. Dividing by $n$ gives the Mean Squared Error:

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2

Key insight: MSE is the negative log-likelihood (up to constants) under Gaussian noise. This explains why MSE is the “natural” loss function for regression.

2) Information Theory Perspective

MSE can also be understood from an information-theoretic perspective. Suppose the true distribution of an observation is Gaussian $\mathcal{N}(y, \sigma^2)$, while the model’s predicted distribution is $\mathcal{N}(\hat{y}, \sigma^2)$, both with the same variance $\sigma^2$.

KL Divergence Definition

The Kullback–Leibler (KL) divergence between two distributions $p$ and $q$ is:

D_{\mathrm{KL}}(p \,\|\, q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx.

KL Divergence Between Gaussians

Let $p(x) = \mathcal{N}(\mu_1,\sigma^2)$ and $q(x) = \mathcal{N}(\mu_2,\sigma^2)$. Then:

\frac{p(x)}{q(x)} = \exp\!\left(\frac{(x-\mu_2)^2 - (x-\mu_1)^2}{2\sigma^2}\right).

Taking the log and expectation under $p(x)$:

D_{\mathrm{KL}}(p \,\|\, q) = \mathbb{E}_{p(x)}\!\left[\frac{(x-\mu_2)^2 - (x-\mu_1)^2}{2\sigma^2}\right].

Since $x \sim \mathcal{N}(\mu_1, \sigma^2)$:

$\mathbb{E}[(x-\mu_1)^2] = \sigma^2$
$\mathbb{E}[(x-\mu_2)^2] = \sigma^2 + (\mu_1-\mu_2)^2$

Substituting back:

D_{\mathrm{KL}}(p \,\|\, q) = \frac{(\mu_1 - \mu_2)^2}{2\sigma^2}.

Substitution for Regression

Setting $\mu_1 = y$ (true value) and $\mu_2 = \hat{y}$ (prediction), we get:

D_{\mathrm{KL}}\!\left(\mathcal{N}(y, \sigma^2) \,\|\, \mathcal{N}(\hat{y}, \sigma^2)\right) = \frac{(y - \hat{y})^2}{2\sigma^2}.

Cross-Entropy Relation

Cross-entropy between two distributions is related to KL divergence:

H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\|\, q).

Here $H(p)$ is the entropy of the true distribution and does not depend on model parameters. Therefore, minimizing the cross-entropy is equivalent to minimizing the KL divergence.

Connection to MSE

Since $D_{\mathrm{KL}}$ between the true Gaussian and predicted Gaussian reduces to the squared error scaled by $1/(2\sigma^2)$, minimizing KL divergence is the same as minimizing:

\sum_{i=1}^n (y^{(i)} - \hat{y}^{(i)})^2.

Normalizing by $n$ gives the familiar Mean Squared Error:

\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y^{(i)} - \hat{y}^{(i)})^2.

Key insight: MSE is proportional to the KL divergence between the true data-generating Gaussian and the model’s predicted Gaussian. Thus, MSE can be viewed not only as the maximum likelihood loss under Gaussian noise but also as the cross-entropy (up to constants) between two Gaussian distributions.

3) Bias–Variance Decomposition

For any estimator $\hat{y}$ of $y$, MSE decomposes into bias and variance:

\mathrm{MSE} = \mathbb{E}[(y-\hat{y})^2] = \mathrm{Var}(y-\hat{y}) + [\mathbb{E}(y-\hat{y})]^2.

This shows MSE captures both variance (fluctuations of predictions) and squared bias (systematic error).

Key Takeaways

Statistical view: MSE is the NLL under Gaussian noise.
Information-theoretic view: MSE is proportional to KL divergence between two Gaussians with the same variance.
Bias–variance tradeoff: MSE balances variance and bias in predictions.

Key Properties

Non-negative: MSE is always ≥ 0, equals 0 only when predictions are perfect
Quadratic penalty: Large errors are penalized more heavily than small errors
Differentiable: Smooth gradient for optimization algorithms
Scale-dependent: MSE units are the square of the target variable units
MLE-derived: Natural loss function under Gaussian noise assumption
Bias-variance decomposition: Balances prediction bias and variance

Gradient Analysis

The gradient of MSE with respect to predictions is:

\frac{\partial \text{MSE}}{\partial \hat{y}_i} = \frac{2}{n} (\hat{y}_i - y_i) = \frac{2}{n} \cdot \text{error}_i

Key insights:

Gradient is proportional to the prediction error
Larger errors produce stronger gradients (faster learning)
Gradient direction pushes predictions toward true values

MSE vs. Other Loss Functions

📊 MSE vs. MAE

MSE (Mean Squared Error):

Penalizes large errors heavily
Sensitive to outliers
Smooth gradients
Good for normal distributions

MAE (Mean Absolute Error):

Linear penalty for all errors
Robust to outliers
Non-smooth at zero error
Good for skewed distributions

🎯 MSE vs. CrossEntropy

MSE:

For regression (continuous values)
Outputs can be any real number
Quadratic penalty
No probability interpretation

CrossEntropy:

For classification (discrete classes)
Outputs are probabilities [0,1]
Logarithmic penalty
Probability interpretation

Applications in Regression

Linear Regression: Classic use case for MSE
Neural Network Regression: Predicting house prices, stock prices
Time Series Forecasting: Weather prediction, sales forecasting
Computer Vision: Image denoising, super-resolution
Signal Processing: Audio reconstruction, sensor calibration

Interactive Visualization

Explore how MSE behaves with different prediction errors:

Detailed Calculation Examples

Let's work through several examples to understand MSE behavior:

Example 1: Perfect Predictions

Sample	True Value ($y$)	Prediction ($\hat{y}$)	Error ($y - \hat{y}$)
1	10.0	10.0	0.0
2	5.0	5.0	0.0
3	8.0	8.0	0.0
MSE	-	-	-

Example 2: Some Errors

Sample	True Value ($y$)	Prediction ($\hat{y}$)	Error ($y - \hat{y}$)	Squared Error
1	10.0	9.0	1.0	1.0
2	5.0	6.0	-1.0	1.0
3	8.0	7.5	0.5	0.25
MSE	-	-	-	0.75

Calculation for Example 2: $$\text{MSE} = \frac{1}{3} \times (1.0 + 1.0 + 0.25) = \frac{2.25}{3} = 0.75$$

Example 3: Large Errors (Outliers)

Sample	True Value ($y$)	Prediction ($\hat{y}$)	Error ($y - \hat{y}$)	Squared Error
1	10.0	8.0	2.0	4.0
2	5.0	5.0	0.0	0.0
3	8.0	15.0	-7.0	49.0
MSE	-	-	-	17.67

Key observation: The large error (-7.0) in sample 3 contributes 49.0 to the MSE, showing how MSE is sensitive to outliers.