Logistic Regression
š Overview
Logistic Regression is a fundamental classification algorithm that models the probability of binary outcomes. Despite its name containing "regression," it's actually a classification algorithm that uses the logistic (sigmoid) function to map linear combinations of features to probabilities.
šÆ Learning Objectives
- Understand the mathematical foundation of logistic regression
- Derive the sigmoid function and decision boundary
- Implement logistic regression from MLE perspective
- Apply logistic regression to binary classification problems
- Compare with linear regression for classification tasks
Mathematical Foundation
The Sigmoid Function
The core of logistic regression is the sigmoid (logistic) function, which maps any real number to a probability between 0 and 1:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Key properties of the sigmoid function:
- Range: Ļ(z) ā (0, 1) for all real z
- Symmetry: Ļ(-z) = 1 - Ļ(z)
- Derivative: Ļ'(z) = Ļ(z)(1 - Ļ(z))
- Asymptotes: lim(zāā) Ļ(z) = 1, lim(zā-ā) Ļ(z) = 0
Logistic Regression Model
For binary classification, we model the probability of class 1:
$$P(y=1|\boldsymbol{x}) = \sigma(\boldsymbol{x}^T \boldsymbol{w} + w_0) = \frac{1}{1 + e^{-(\boldsymbol{x}^T \boldsymbol{w} + w_0)}}$$
Where:
- $\boldsymbol{x}$ is the feature vector
- $\boldsymbol{w}$ is the weight vector
- $w_0$ is the bias term
- $P(y=1|\boldsymbol{x})$ is the probability of class 1
Decision Boundary
The decision boundary occurs when the probability equals 0.5:
This defines a hyperplane that separates the two classes.
Theoretical Foundation: Maximum Likelihood Estimation
Assumption
We assume that $y_i$ follows a Bernoulli distribution:
Where $p_i = \sigma(\boldsymbol{x}_i^T \boldsymbol{w} + w_0)$.
Likelihood Function
The probability mass function for a Bernoulli random variable is:
For all $n$ observations, the likelihood function is:
Log-Likelihood
Taking the natural logarithm:
Cross-Entropy Loss
To minimize (instead of maximize), we use the negative log-likelihood:
This is exactly the binary cross-entropy loss function!
Gradient Derivation
The gradient of the loss with respect to $\boldsymbol{w}$ is:
Derivation:
Key Properties
š Probabilistic Output
Provides probability estimates, not just class predictions.
šÆ Linear Decision Boundary
Creates linear separations in feature space.
š Smooth Function
Sigmoid function is smooth and differentiable everywhere.
š Interpretable
Coefficients can be interpreted as log-odds ratios.
ā” Fast Training
Convex optimization problem with unique global minimum.
š« No Assumptions
No assumptions about feature distributions (unlike LDA).
Applications
- Healthcare: Disease diagnosis, treatment outcome prediction, medical image analysis
- Finance: Credit scoring, fraud detection, loan approval
- Marketing: Customer churn prediction, conversion analysis, A/B testing
- Engineering: Quality control, failure prediction, defect detection
- Social Sciences: Survey analysis, behavioral prediction, policy evaluation
Interactive Visualization
Explore the sigmoid function and decision boundary:
Comparison with Linear Regression
Aspect | Linear Regression | Logistic Regression |
---|---|---|
Problem Type | Regression (continuous output) | Classification (binary output) |
Output Range | (-ā, +ā) | (0, 1) |
Activation Function | Linear (identity) | Sigmoid |
Loss Function | MSE | Cross-Entropy |
Optimization | Closed-form solution available | Iterative optimization required |
Assumptions | Gaussian errors | Bernoulli distribution |
š» Code Examples
NumPy, scikit-learn, and PyTorch implementations
š Softmax Regression
Extension to multiclass classification
šļø Exercises
Hands-on practice problems
Detailed Example: Email Spam Detection
Let's work through a practical example of detecting spam emails based on word frequencies.
Sample Data
"Free" Count | "Money" Count | Is Spam? | |
---|---|---|---|
1 | 3 | 1 | 1 (Spam) |
2 | 1 | 0 | 0 (Not Spam) |
3 | 5 | 2 | 1 (Spam) |
4 | 0 | 0 | 0 (Not Spam) |
5 | 2 | 1 | 1 (Spam) |
Model Setup
We want to predict spam probability based on word counts:
Training Process
Using gradient descent to minimize cross-entropy loss:
Interpretation
After training, we might get:
This means:
- Bias $w_0$ (-2.5): Base log-odds when no keywords present
- "Free" coefficient $w_1$ (1.2): Each "free" word increases log-odds by 1.2
- "Money" coefficient $w_2$ (0.8): Each "money" word increases log-odds by 0.8
Prediction Example
For an email with 2 "free" words and 1 "money" word:
So there's a 67% chance this email is spam.