← Back to AI Learning Hub

CrossEntropy Loss Function

📋 Overview

CrossEntropy is a fundamental loss function for classification. This page covers the mathematical foundation (MLE & information theory), gradients, and worked examples.

🎯 Learning Objectives

  • Understand why CrossEntropy arises from MLE
  • Use CrossEntropy consistently with hard and soft labels
  • Recall the key gradient result $\frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y$
  • Apply CrossEntropy to real classification problems
⏱️ Estimated Time: 15–20 minutes reading + 30 minutes practice

Mathematical Foundation

CrossEntropy measures the discrepancy between the $y$ (target distribution) and $\hat{y}$ (predicted distribution).

Binary CrossEntropy (Bernoulli):

$$\mathcal{L}(y,\hat{y}) = -\big[ y\,\log\hat{y} + (1-y)\,\log(1-\hat{y}) \big],\quad y\in[0,1]$$
Multi-class CrossEntropy (Categorical):

$$\mathcal{L}(y, \hat{y}) = - \sum_{i=1}^{C} y_i\,\log \hat{y}_i$$

where $y$ is the target distribution (hard or soft) and $\hat{y}$ is the predicted distribution (e.g., softmax output).

Theoretical Foundations

1. Maximum Likelihood Estimation (MLE) Perspective

Assume we have $N$ samples. The target for sample $n$ is a distribution $y^{(n)} = [y^{(n)}_1,\ldots,y^{(n)}_C]$ (hard one-hot or soft). The model predicts $\hat{y}^{(n)} = [\hat{y}^{(n)}_1,\ldots,\hat{y}^{(n)}_C]$ with $\sum_i \hat{y}^{(n)}_i = 1$.

Likelihood:

$$L(\theta) = \prod_{n=1}^N \prod_{i=1}^C \big(\hat{y}^{(n)}_i\big)^{\,y^{(n)}_i}$$

Taking logs:

$$\log L(\theta) = \sum_{n=1}^N \sum_{i=1}^C y^{(n)}_i \log \hat{y}^{(n)}_i.$$

Minimizing the negative log-likelihood (NLL) gives exactly the CrossEntropy:

$$\mathcal{L}_{\text{NLL}}(\theta) = -\log L(\theta) = -\sum_{n=1}^N \sum_{i=1}^C y^{(n)}_i \log \hat{y}^{(n)}_i.$$

Key insight: CrossEntropy is the negative log-likelihood whether $y$ is one-hot (hard) or a soft distribution.

2. Information Theory Perspective

In information theory, CrossEntropy $H(y,\hat{y})$ represents the expected code length for data sampled from the true distribution $y$ when using a code optimized for the predicted distribution $\hat{y}$.

$$H(y,\hat{y}) = -\sum_{i=1}^{C} y_i \log \hat{y}_i$$

CrossEntropy is related to the entropy of the true distribution and the KL divergence as:

$$H(y,\hat{y}) = H(y) + D_{\mathrm{KL}}(y \,\|\, \hat{y}).$$

Here:

$$H(y) = -\sum_{i=1}^{C} y_i \log y_i \quad \text{(Entropy)}$$
$$D_{\mathrm{KL}}(y \,\|\, \hat{y}) = \sum_{i=1}^{C} y_i \log \frac{y_i}{\hat{y}_i} = \sum_{i=1}^{C} y_i \log y_i - \sum_{i=1}^{C} y_i \log \hat{y}_i = -H(y) + H(y,\hat{y}).$$
Interpretation:
  • $H(y,\hat{y})$ is the cross-entropy between the true and predicted distributions.
  • $H(y)$ measures the inherent uncertainty of the true distribution.
  • $D_{\mathrm{KL}}(y \,\|\, \hat{y}) \geq 0$ and equals zero iff $y = \hat{y}$.
  • It quantifies the inefficiency (extra code length) incurred when using $\hat{y}$ instead of $y$.
  • Since $H(y)$ is constant w.r.t. model parameters, minimizing CrossEntropy is exactly equivalent to minimizing KL divergence.

3. Gradient Analysis

Let $\hat{y} = \mathrm{softmax}(z)$. The gradient of CrossEntropy w.r.t. the logits $z$ is:

$$\frac{\partial \mathcal{L}}{\partial z_i} = \hat{y}_i - y_i.$$

This holds for both hard and soft labels.

Understanding Hard vs. Soft Labels

Hard Labels (one-hot)

For 3 classes:

Class 1: $y = [1, 0, 0]$    Class 2: $y = [0, 1, 0]$    Class 3: $y = [0, 0, 1]$

Soft Labels (probability distributions)

Examples:

$y = [0.7, 0.2, 0.1]$,   $y = [0.4, 0.3, 0.3]$,   $y = [0.95, 0.03, 0.02]$

Applications in Classification

  • Image: object recognition, handwritten digits
  • Text: sentiment analysis, spam detection
  • Medical: diagnosis from images/symptoms
  • Recommenders: preference/category prediction
  • NLP: NER, POS tagging

Why CrossEntropy?

  • Probabilistic: directly compares distributions $y$ vs. $\hat{y}$
  • Useful gradients: strong updates when predictions are wrong
  • Convex for logistic regression
  • Information-theoretic meaning

💻 Code Examples

NumPy, PyTorch, TensorFlow implementations

📊 Visualizations

Loss curves and gradient behavior

🏋️ Exercises

Hands-on practice problems

Detailed Calculation Examples

Worked examples to see how CrossEntropy behaves.

Hard Label Examples

Sample True Label (y) Prediction ($\hat{y}$) CrossEntropy Interpretation
1 $[1, 0, 0]$ $[0.8, 0.1, 0.1]$ $0.223$ Good prediction
2 $[0, 1, 0]$ $[0.3, 0.4, 0.3]$ $0.916$ Uncertain prediction
3 $[0, 0, 1]$ $[0.9, 0.05, 0.05]$ $2.996$ Wrong prediction
4 $[1, 0, 0]$ $[0.99, 0.005, 0.005]$ $0.010$ Excellent prediction
5 $[0, 1, 0]$ $[0.33, 0.33, 0.34]$ $1.099$ Random prediction

Soft Label Examples

Sample True Label (y) Prediction ($\hat{y}$) CrossEntropy Interpretation
6 $[0.7, 0.2, 0.1]$ $[0.6, 0.3, 0.1]$ $0.829$ Close match
7 $[0.4, 0.3, 0.3]$ $[0.8, 0.1, 0.1]$ $0.891$ Overconfident
8 $[0.95, 0.03, 0.02]$ $[0.93, 0.04, 0.03]$ $0.021$ Excellent match
Calculation for Sample 1 (hard):

$$\mathcal{L}(y,\hat{y}) = -[1\cdot\log(0.8) + 0\cdot\log(0.1) + 0\cdot\log(0.1)] = -\log(0.8) = 0.223.$$
Calculation for Sample 6 (soft):

$$\mathcal{L}(y,\hat{y}) = -[0.7\log(0.6) + 0.2\log(0.3) + 0.1\log(0.1)]$$ $$= -[0.7(-0.511) + 0.2(-1.204) + 0.1(-2.303)] = 0.829.$$

Key Observations

  • Perfect prediction: If $y=\hat{y}$, CE equals the entropy $H(y)$
  • Confident & wrong: CE grows quickly when the model is confidently wrong
  • Uncertain: Near-uniform predictions give moderate loss
  • Soft labels: Naturally handled by the same CE formula