CrossEntropy Loss Function

📋 Overview

CrossEntropy is a fundamental loss function for classification. This page covers the mathematical foundation (MLE & information theory), gradients, and worked examples.

🎯 Learning Objectives

Understand why CrossEntropy arises from MLE
Use CrossEntropy consistently with hard and soft labels
Recall the key gradient result $\frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y$
Apply CrossEntropy to real classification problems

⏱️ Estimated Time: 15–20 minutes reading + 30 minutes practice

Mathematical Foundation

CrossEntropy measures the discrepancy between the $$y$$ (target distribution) and $\hat{y}$ (predicted distribution).

Binary CrossEntropy (Bernoulli): $$\mathcal{L}(y,\hat{y}) = -\big[ y\,\log\hat{y} + (1-y)\,\log(1-\hat{y}) \big],\quad y\in[0,1]$$

Multi-class CrossEntropy (Categorical): $$\mathcal{L}(y, \hat{y}) = - \sum_{i=1}^{C} y_i\,\log \hat{y}_i$$ where $y$ is the target distribution (hard or soft) and $\hat{y}$ is the predicted distribution (e.g., softmax output).

Theoretical Foundations

1. Maximum Likelihood Estimation (MLE) Perspective

Assume we have $N$ samples. The target for sample $n$ is a distribution $y^{(n)} = [y^{(n)}_1,\ldots,y^{(n)}_C]$ (hard one-hot or soft). The model predicts $\hat{y}^{(n)} = [\hat{y}^{(n)}_1,\ldots,\hat{y}^{(n)}_C]$ with $\sum_i \hat{y}^{(n)}_i = 1$.

Likelihood: $$L(\theta) = \prod_{n=1}^N \prod_{i=1}^C \big(\hat{y}^{(n)}_i\big)^{\,y^{(n)}_i}$$

Taking logs:

\log L(\theta) = \sum_{n=1}^N \sum_{i=1}^C y^{(n)}_i \log \hat{y}^{(n)}_i.

Minimizing the negative log-likelihood (NLL) gives exactly the CrossEntropy:

\mathcal{L}_{\text{NLL}}(\theta) = -\log L(\theta) = -\sum_{n=1}^N \sum_{i=1}^C y^{(n)}_i \log \hat{y}^{(n)}_i.

Key insight: CrossEntropy is the negative log-likelihood whether $y$ is one-hot (hard) or a soft distribution.

2. Information Theory Perspective

In information theory, CrossEntropy $H(y,\hat{y})$ represents the expected code length for data sampled from the true distribution $$y$$ when using a code optimized for the predicted distribution $\hat{y}$ .

H(y,\hat{y}) = -\sum_{i=1}^{C} y_i \log \hat{y}_i

CrossEntropy is related to the entropy of the true distribution and the KL divergence as:

H(y,\hat{y}) = H(y) + D_{\mathrm{KL}}(y \,\|\, \hat{y}).

Here:

H(y) = -\sum_{i=1}^{C} y_i \log y_i \quad \text{(Entropy)}

D_{\mathrm{KL}}(y \,\|\, \hat{y}) = \sum_{i=1}^{C} y_i \log \frac{y_i}{\hat{y}_i} = \sum_{i=1}^{C} y_i \log y_i - \sum_{i=1}^{C} y_i \log \hat{y}_i = -H(y) + H(y,\hat{y}).

Interpretation:

$H(y,\hat{y})$ is the cross-entropy between the true and predicted distributions.
$$H(y)$$ measures the inherent uncertainty of the true distribution.
$D_{\mathrm{KL}}(y \,\|\, \hat{y}) \geq 0$ and equals zero iff $y = \hat{y}$ .

\hat{y}

$y$

Since $$H(y)$$ is constant w.r.t. model parameters, minimizing CrossEntropy is exactly equivalent to minimizing KL divergence.

3. Gradient Analysis

Let $\hat{y} = \mathrm{softmax}(z)$. The gradient of CrossEntropy w.r.t. the logits $z$ is:

\frac{\partial \mathcal{L}}{\partial z_i} = \hat{y}_i - y_i.

This holds for both hard and soft labels.

Understanding Hard vs. Soft Labels

Hard Labels (one-hot)

For 3 classes: Class 1: $y = [1, 0, 0]$ Class 2: $y = [0, 1, 0]$ Class 3: $y = [0, 0, 1]$

Soft Labels (probability distributions)

Examples: $y = [0.7, 0.2, 0.1]$, $y = [0.4, 0.3, 0.3]$, $y = [0.95, 0.03, 0.02]$

Applications in Classification

Image: object recognition, handwritten digits
Text: sentiment analysis, spam detection
Medical: diagnosis from images/symptoms
Recommenders: preference/category prediction
NLP: NER, POS tagging

Why CrossEntropy?

Probabilistic: directly compares distributions $y$ vs. $\hat{y}$
Useful gradients: strong updates when predictions are wrong
Convex for logistic regression
Information-theoretic meaning

💻 Code Examples

NumPy, PyTorch, TensorFlow implementations

📊 Visualizations

Loss curves and gradient behavior

🏋️ Exercises

Hands-on practice problems

Detailed Calculation Examples

Worked examples to see how CrossEntropy behaves.

Hard Label Examples

Sample	True Label (y)	Prediction ($\hat{y}$)	CrossEntropy	Interpretation
1	$$[1, 0, 0]$$	$$[0.8, 0.1, 0.1]$$	$$0.223$$	Good prediction
2	$$[0, 1, 0]$$	$$[0.3, 0.4, 0.3]$$	$$0.916$$	Uncertain prediction
3	$$[0, 0, 1]$$	$$[0.9, 0.05, 0.05]$$	$$2.996$$	Wrong prediction
4	$$[1, 0, 0]$$	$$[0.99, 0.005, 0.005]$$	$$0.010$$	Excellent prediction
5	$$[0, 1, 0]$$	$$[0.33, 0.33, 0.34]$$	$$1.099$$	Random prediction

Soft Label Examples

Sample	True Label (y)	Prediction ($\hat{y}$)	CrossEntropy	Interpretation
6	$$[0.7, 0.2, 0.1]$$	$$[0.6, 0.3, 0.1]$$	$$0.829$$	Close match
7	$$[0.4, 0.3, 0.3]$$	$$[0.8, 0.1, 0.1]$$	$$0.891$$	Overconfident
8	$$[0.95, 0.03, 0.02]$$	$$[0.93, 0.04, 0.03]$$	$$0.021$$	Excellent match

Calculation for Sample 1 (hard): $$\mathcal{L}(y,\hat{y}) = -[1\cdot\log(0.8) + 0\cdot\log(0.1) + 0\cdot\log(0.1)] = -\log(0.8) = 0.223.$$

Calculation for Sample 6 (soft): $$\mathcal{L}(y,\hat{y}) = -[0.7\log(0.6) + 0.2\log(0.3) + 0.1\log(0.1)]$$ $$= -[0.7(-0.511) + 0.2(-1.204) + 0.1(-2.303)] = 0.829.$$

Key Observations

Perfect prediction: If $y=\hat{y}$, CE equals the entropy $H(y)$
Confident & wrong: CE grows quickly when the model is confidently wrong
Uncertain: Near-uniform predictions give moderate loss
Soft labels: Naturally handled by the same CE formula