CrossEntropy Loss Function
📋 Overview
CrossEntropy is a fundamental loss function for classification. This page covers the mathematical foundation (MLE & information theory), gradients, and worked examples.
🎯 Learning Objectives
- Understand why CrossEntropy arises from MLE
- Use CrossEntropy consistently with hard and soft labels
- Recall the key gradient result $\frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y$
- Apply CrossEntropy to real classification problems
Mathematical Foundation
CrossEntropy measures the discrepancy between the $y$ (target distribution) and $\hat{y}$ (predicted distribution).
$$\mathcal{L}(y,\hat{y}) = -\big[ y\,\log\hat{y} + (1-y)\,\log(1-\hat{y}) \big],\quad y\in[0,1]$$
$$\mathcal{L}(y, \hat{y}) = - \sum_{i=1}^{C} y_i\,\log \hat{y}_i$$
where $y$ is the target distribution (hard or soft) and $\hat{y}$ is the predicted distribution (e.g., softmax output).
Theoretical Foundations
1. Maximum Likelihood Estimation (MLE) Perspective
Assume we have $N$ samples. The target for sample $n$ is a distribution $y^{(n)} = [y^{(n)}_1,\ldots,y^{(n)}_C]$ (hard one-hot or soft). The model predicts $\hat{y}^{(n)} = [\hat{y}^{(n)}_1,\ldots,\hat{y}^{(n)}_C]$ with $\sum_i \hat{y}^{(n)}_i = 1$.
$$L(\theta) = \prod_{n=1}^N \prod_{i=1}^C \big(\hat{y}^{(n)}_i\big)^{\,y^{(n)}_i}$$
Taking logs:
Minimizing the negative log-likelihood (NLL) gives exactly the CrossEntropy:
Key insight: CrossEntropy is the negative log-likelihood whether $y$ is one-hot (hard) or a soft distribution.
2. Information Theory Perspective
In information theory, CrossEntropy $H(y,\hat{y})$ represents the expected code length for data sampled from the true distribution $y$ when using a code optimized for the predicted distribution $\hat{y}$.
CrossEntropy is related to the entropy of the true distribution and the KL divergence as:
Here:
- $H(y,\hat{y})$ is the cross-entropy between the true and predicted distributions.
- $H(y)$ measures the inherent uncertainty of the true distribution.
- $D_{\mathrm{KL}}(y \,\|\, \hat{y}) \geq 0$ and equals zero iff $y = \hat{y}$. It quantifies the inefficiency (extra code length) incurred when using $\hat{y}$ instead of $y$.
- Since $H(y)$ is constant w.r.t. model parameters, minimizing CrossEntropy is exactly equivalent to minimizing KL divergence.
3. Gradient Analysis
Let $\hat{y} = \mathrm{softmax}(z)$. The gradient of CrossEntropy w.r.t. the logits $z$ is:
This holds for both hard and soft labels.
Understanding Hard vs. Soft Labels
Hard Labels (one-hot)
Class 1: $y = [1, 0, 0]$ Class 2: $y = [0, 1, 0]$ Class 3: $y = [0, 0, 1]$
Soft Labels (probability distributions)
$y = [0.7, 0.2, 0.1]$, $y = [0.4, 0.3, 0.3]$, $y = [0.95, 0.03, 0.02]$
Applications in Classification
- Image: object recognition, handwritten digits
- Text: sentiment analysis, spam detection
- Medical: diagnosis from images/symptoms
- Recommenders: preference/category prediction
- NLP: NER, POS tagging
Why CrossEntropy?
- Probabilistic: directly compares distributions $y$ vs. $\hat{y}$
- Useful gradients: strong updates when predictions are wrong
- Convex for logistic regression
- Information-theoretic meaning
💻 Code Examples
NumPy, PyTorch, TensorFlow implementations
📊 Visualizations
Loss curves and gradient behavior
🏋️ Exercises
Hands-on practice problems
Detailed Calculation Examples
Worked examples to see how CrossEntropy behaves.
Hard Label Examples
Sample | True Label (y) | Prediction ($\hat{y}$) | CrossEntropy | Interpretation |
---|---|---|---|---|
1 | $[1, 0, 0]$ | $[0.8, 0.1, 0.1]$ | $0.223$ | Good prediction |
2 | $[0, 1, 0]$ | $[0.3, 0.4, 0.3]$ | $0.916$ | Uncertain prediction |
3 | $[0, 0, 1]$ | $[0.9, 0.05, 0.05]$ | $2.996$ | Wrong prediction |
4 | $[1, 0, 0]$ | $[0.99, 0.005, 0.005]$ | $0.010$ | Excellent prediction |
5 | $[0, 1, 0]$ | $[0.33, 0.33, 0.34]$ | $1.099$ | Random prediction |
Soft Label Examples
Sample | True Label (y) | Prediction ($\hat{y}$) | CrossEntropy | Interpretation |
---|---|---|---|---|
6 | $[0.7, 0.2, 0.1]$ | $[0.6, 0.3, 0.1]$ | $0.829$ | Close match |
7 | $[0.4, 0.3, 0.3]$ | $[0.8, 0.1, 0.1]$ | $0.891$ | Overconfident |
8 | $[0.95, 0.03, 0.02]$ | $[0.93, 0.04, 0.03]$ | $0.021$ | Excellent match |
$$\mathcal{L}(y,\hat{y}) = -[1\cdot\log(0.8) + 0\cdot\log(0.1) + 0\cdot\log(0.1)] = -\log(0.8) = 0.223.$$
$$\mathcal{L}(y,\hat{y}) = -[0.7\log(0.6) + 0.2\log(0.3) + 0.1\log(0.1)]$$ $$= -[0.7(-0.511) + 0.2(-1.204) + 0.1(-2.303)] = 0.829.$$
Key Observations
- Perfect prediction: If $y=\hat{y}$, CE equals the entropy $H(y)$
- Confident & wrong: CE grows quickly when the model is confidently wrong
- Uncertain: Near-uniform predictions give moderate loss
- Soft labels: Naturally handled by the same CE formula