← Back to Machine Learning

MSE Loss Function

📋 Overview

Mean Squared Error (MSE) is the fundamental loss function for regression problems. This page covers the mathematical foundation, properties, and practical applications of MSE loss.

🎯 Learning Objectives

  • Understand when to use MSE vs. other loss functions
  • Implement MSE from scratch in different frameworks
  • Visualize MSE behavior and compare with other losses
  • Apply MSE in real regression problems
⏱️ Estimated Time: 15–20 minutes reading + 30 minutes practice

Mathematical Foundation

MSE measures the average squared differences between predicted and actual values:

MSE Formula:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Where:

  • $y_i$ is the true value for sample $i$
  • $\hat{y}_i$ is the predicted value for sample $i$
  • $n$ is the number of samples

Theoretical Foundation of MSE

1) Maximum Likelihood Estimation (MLE) Perspective

MSE naturally arises from Maximum Likelihood Estimation under the assumption of Gaussian noise. Let's derive it step by step:

Problem Setup

Assume we have a regression model $f(\boldsymbol{x}; \boldsymbol{\theta})$ that predicts continuous values. The true relationship is:

$$y^{(i)} = f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}) + \epsilon^{(i)}$$

Where $\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise with mean 0 and variance $\sigma^2$.

Likelihood Function

$$L(\boldsymbol{\theta}) = \prod_{i=1}^{n} p(y^{(i)} | \boldsymbol{x}^{(i)}, \boldsymbol{\theta})$$
$$p(y^{(i)} | \boldsymbol{x}^{(i)}, \boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2}{2\sigma^2}\right)$$

Log-Likelihood

$$\log L(\boldsymbol{\theta}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n} (y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2$$

Negative Log-Likelihood

$$-\log L(\boldsymbol{\theta}) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^{n} (y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2$$

Connection to MSE

Since $\sigma^2$ is constant and does not depend on $\boldsymbol{\theta}$, maximizing likelihood is equivalent to minimizing the sum of squared errors. Dividing by $n$ gives the Mean Squared Error:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y^{(i)} - f(\boldsymbol{x}^{(i)}; \boldsymbol{\theta}))^2$$

Key insight: MSE is the negative log-likelihood (up to constants) under Gaussian noise. This explains why MSE is the “natural” loss function for regression.

2) Information Theory Perspective

MSE can also be understood from an information-theoretic perspective. Suppose the true distribution of an observation is Gaussian $\mathcal{N}(y, \sigma^2)$, while the model’s predicted distribution is $\mathcal{N}(\hat{y}, \sigma^2)$, both with the same variance $\sigma^2$.

KL Divergence Definition

The Kullback–Leibler (KL) divergence between two distributions $p$ and $q$ is:

$$D_{\mathrm{KL}}(p \,\|\, q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx.$$

KL Divergence Between Gaussians

Let $p(x) = \mathcal{N}(\mu_1,\sigma^2)$ and $q(x) = \mathcal{N}(\mu_2,\sigma^2)$. Then:

$$\frac{p(x)}{q(x)} = \exp\!\left(\frac{(x-\mu_2)^2 - (x-\mu_1)^2}{2\sigma^2}\right).$$

Taking the log and expectation under $p(x)$:

$$D_{\mathrm{KL}}(p \,\|\, q) = \mathbb{E}_{p(x)}\!\left[\frac{(x-\mu_2)^2 - (x-\mu_1)^2}{2\sigma^2}\right].$$

Since $x \sim \mathcal{N}(\mu_1, \sigma^2)$:

  • $\mathbb{E}[(x-\mu_1)^2] = \sigma^2$
  • $\mathbb{E}[(x-\mu_2)^2] = \sigma^2 + (\mu_1-\mu_2)^2$

Substituting back:

$$D_{\mathrm{KL}}(p \,\|\, q) = \frac{(\mu_1 - \mu_2)^2}{2\sigma^2}.$$

Substitution for Regression

Setting $\mu_1 = y$ (true value) and $\mu_2 = \hat{y}$ (prediction), we get:

$$D_{\mathrm{KL}}\!\left(\mathcal{N}(y, \sigma^2) \,\|\, \mathcal{N}(\hat{y}, \sigma^2)\right) = \frac{(y - \hat{y})^2}{2\sigma^2}.$$

Cross-Entropy Relation

Cross-entropy between two distributions is related to KL divergence:

$$H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\|\, q).$$

Here $H(p)$ is the entropy of the true distribution and does not depend on model parameters. Therefore, minimizing the cross-entropy is equivalent to minimizing the KL divergence.

Connection to MSE

Since $D_{\mathrm{KL}}$ between the true Gaussian and predicted Gaussian reduces to the squared error scaled by $1/(2\sigma^2)$, minimizing KL divergence is the same as minimizing:

$$\sum_{i=1}^n (y^{(i)} - \hat{y}^{(i)})^2.$$

Normalizing by $n$ gives the familiar Mean Squared Error:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y^{(i)} - \hat{y}^{(i)})^2.$$

Key insight: MSE is proportional to the KL divergence between the true data-generating Gaussian and the model’s predicted Gaussian. Thus, MSE can be viewed not only as the maximum likelihood loss under Gaussian noise but also as the cross-entropy (up to constants) between two Gaussian distributions.

3) Bias–Variance Decomposition

For any estimator $\hat{y}$ of $y$, MSE decomposes into bias and variance:

$$\mathrm{MSE} = \mathbb{E}[(y-\hat{y})^2] = \mathrm{Var}(y-\hat{y}) + [\mathbb{E}(y-\hat{y})]^2.$$

This shows MSE captures both variance (fluctuations of predictions) and squared bias (systematic error).

Key Takeaways

  • Statistical view: MSE is the NLL under Gaussian noise.
  • Information-theoretic view: MSE is proportional to KL divergence between two Gaussians with the same variance.
  • Bias–variance tradeoff: MSE balances variance and bias in predictions.

Key Properties

  • Non-negative: MSE is always ≥ 0, equals 0 only when predictions are perfect
  • Quadratic penalty: Large errors are penalized more heavily than small errors
  • Differentiable: Smooth gradient for optimization algorithms
  • Scale-dependent: MSE units are the square of the target variable units
  • MLE-derived: Natural loss function under Gaussian noise assumption
  • Bias-variance decomposition: Balances prediction bias and variance

Gradient Analysis

The gradient of MSE with respect to predictions is:

$$\frac{\partial \text{MSE}}{\partial \hat{y}_i} = \frac{2}{n} (\hat{y}_i - y_i) = \frac{2}{n} \cdot \text{error}_i$$

Key insights:

  • Gradient is proportional to the prediction error
  • Larger errors produce stronger gradients (faster learning)
  • Gradient direction pushes predictions toward true values

MSE vs. Other Loss Functions

📊 MSE vs. MAE

MSE (Mean Squared Error):

  • Penalizes large errors heavily
  • Sensitive to outliers
  • Smooth gradients
  • Good for normal distributions

MAE (Mean Absolute Error):

  • Linear penalty for all errors
  • Robust to outliers
  • Non-smooth at zero error
  • Good for skewed distributions

🎯 MSE vs. CrossEntropy

MSE:

  • For regression (continuous values)
  • Outputs can be any real number
  • Quadratic penalty
  • No probability interpretation

CrossEntropy:

  • For classification (discrete classes)
  • Outputs are probabilities [0,1]
  • Logarithmic penalty
  • Probability interpretation

Applications in Regression

  • Linear Regression: Classic use case for MSE
  • Neural Network Regression: Predicting house prices, stock prices
  • Time Series Forecasting: Weather prediction, sales forecasting
  • Computer Vision: Image denoising, super-resolution
  • Signal Processing: Audio reconstruction, sensor calibration

Interactive Visualization

Explore how MSE behaves with different prediction errors:

Detailed Calculation Examples

Let's work through several examples to understand MSE behavior:

Example 1: Perfect Predictions

Sample True Value ($y$) Prediction ($\hat{y}$) Error ($y - \hat{y}$) Squared Error
1 10.0 10.0 0.0 0.0
2 5.0 5.0 0.0 0.0
3 8.0 8.0 0.0 0.0
MSE - - - 0.0

Example 2: Some Errors

Sample True Value ($y$) Prediction ($\hat{y}$) Error ($y - \hat{y}$) Squared Error
1 10.0 9.0 1.0 1.0
2 5.0 6.0 -1.0 1.0
3 8.0 7.5 0.5 0.25
MSE - - - 0.75
Calculation for Example 2:

$$\text{MSE} = \frac{1}{3} \times (1.0 + 1.0 + 0.25) = \frac{2.25}{3} = 0.75$$

Example 3: Large Errors (Outliers)

Sample True Value ($y$) Prediction ($\hat{y}$) Error ($y - \hat{y}$) Squared Error
1 10.0 8.0 2.0 4.0
2 5.0 5.0 0.0 0.0
3 8.0 15.0 -7.0 49.0
MSE - - - 17.67

Key observation: The large error (-7.0) in sample 3 contributes 49.0 to the MSE, showing how MSE is sensitive to outliers.