Softmax Regression
๐ Overview
Softmax Regression (also called Multinomial Logistic Regression) extends logistic regression to multiclass classification problems. It uses the softmax function to convert raw scores (logits) into probability distributions over multiple classes.
๐ฏ Learning Objectives
- Understand the mathematical foundation of softmax regression
- Derive the softmax function and its properties
- Implement softmax regression from MLE perspective
- Apply softmax regression to multiclass problems
- Compare with binary logistic regression
Mathematical Foundation
The Softmax Function
The softmax function converts a vector of K real numbers into a probability distribution over K classes:
$$\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \quad \text{for } i = 1, 2, ..., K$$
Key properties of the softmax function:
- Normalization: ฮฃ(i=1 to K) ฯ(z_i) = 1
- Range: ฯ(z_i) โ (0, 1) for all i
- Argmax Property: argmax(z) = argmax(ฯ(z))
- Differentiable: Smooth everywhere
Softmax Regression Model
For multiclass classification with K classes, we model the probability of each class:
$$P(y=k|\boldsymbol{x}) = \frac{e^{\boldsymbol{w}_k^T \boldsymbol{x} + b_k}}{\sum_{j=1}^{K} e^{\boldsymbol{w}_j^T \boldsymbol{x} + b_j}}$$
Where:
- $\boldsymbol{x}$ is the feature vector
- $\boldsymbol{w}_k$ is the weight vector for class k
- $b_k$ is the bias term for class k
- $P(y=k|\boldsymbol{x})$ is the probability of class k
Matrix Form
We can write this more compactly using matrix notation:
$$\boldsymbol{z} = \boldsymbol{W}^T \boldsymbol{x} + \boldsymbol{b}$$
$$\boldsymbol{p} = \text{softmax}(\boldsymbol{z})$$
Where:
- $\boldsymbol{W} \in \mathbb{R}^{d \times K}$ is the weight matrix
- $\boldsymbol{b} \in \mathbb{R}^K$ is the bias vector
- $\boldsymbol{z} \in \mathbb{R}^K$ is the logits vector
- $\boldsymbol{p} \in \mathbb{R}^K$ is the probability vector
Theoretical Foundation: Maximum Likelihood Estimation
Assumption
We assume that $y_i$ follows a categorical distribution:
Where $\boldsymbol{p}_i = \text{softmax}(\boldsymbol{W}^T \boldsymbol{x}_i + \boldsymbol{b})$.
Likelihood Function
For one-hot encoded labels, the probability mass function is:
For all $n$ observations, the likelihood function is:
Where $y_{i,k}$ is 1 if sample i belongs to class k, 0 otherwise (one-hot encoding).
Log-Likelihood
Taking the natural logarithm:
Cross-Entropy Loss
To minimize (instead of maximize), we use the negative log-likelihood:
This is exactly the categorical cross-entropy loss function!
Gradient Derivation
The gradient of the loss with respect to $\boldsymbol{w}_k$ is:
And with respect to $b_k$:
Key Properties
๐ Probability Distribution
Outputs valid probability distributions over all classes.
๐ฏ Multiclass Support
Handles any number of classes K โฅ 2 naturally.
๐ Smooth Function
Softmax function is smooth and differentiable everywhere.
๐ Interpretable
Probabilities can be directly interpreted as confidence scores.
โก Fast Training
Convex optimization problem with unique global minimum.
๐ซ No Assumptions
No assumptions about feature distributions.
Applications
- Computer Vision: Image classification (CIFAR-10, ImageNet), object detection
- NLP: Text classification, sentiment analysis, language detection, topic modeling
- Healthcare: Disease classification, medical image analysis, drug discovery
- Finance: Risk rating, customer segmentation, fraud detection
- Engineering: Quality classification, fault diagnosis, system monitoring
Interactive Visualization
Explore the softmax function with different logit values:
Comparison: Binary vs Multiclass
Aspect | Logistic Regression (Binary) | Softmax Regression (Multiclass) |
---|---|---|
Number of Classes | 2 (K = 2) | Multiple (K > 2) |
Activation Function | Sigmoid ฯ(z) | Softmax ฯ(z) |
Output Range | P(y=1) โ (0, 1) | ฮฃ P(y=k) = 1 |
Parameters | w โ โแต, b โ โ | W โ โแตหฃแดท, b โ โแดท |
Loss Function | Binary Cross-Entropy | Categorical Cross-Entropy |
Decision Rule | P(y=1) > 0.5 | argmax P(y=k) |
๐ป Code Examples
NumPy, scikit-learn, and PyTorch implementations
๐ Advanced Topics
One-vs-Rest, One-vs-One strategies
๐๏ธ Exercises
Hands-on practice problems
Detailed Example: Handwritten Digit Classification
Let's work through a practical example of classifying handwritten digits (0-9) using softmax regression.
Problem Setup
We have 10 classes (digits 0-9) and want to predict the probability of each digit given pixel features.
Sample Prediction
For an input image $\boldsymbol{x}$, we compute logits for all 10 classes:
Suppose we get logits: $\boldsymbol{z} = [2.1, -0.5, 0.8, 1.2, -1.1, 3.0, 0.3, -0.2, 1.5, 0.1]^T$
Softmax Calculation
First, compute the exponential of each logit:
Sum of exponentials: $\sum_{k=0}^{9} e^{z_k} = 8.17 + 0.61 + 2.23 + ... + 1.11 = 28.45$
Final probabilities:
Prediction
The predicted class is the one with the highest probability:
So the model predicts this is digit "0" with 28.7% confidence.
Training Process
During training, we minimize the categorical cross-entropy loss:
Where $y_{i,k}$ is 1 if sample i is digit k, 0 otherwise.