← Back to Machine Learning

Logistic Regression

šŸ“‹ Overview

Logistic Regression is a fundamental classification algorithm that models the probability of binary outcomes. Despite its name containing "regression," it's actually a classification algorithm that uses the logistic (sigmoid) function to map linear combinations of features to probabilities.

šŸŽÆ Learning Objectives

  • Understand the mathematical foundation of logistic regression
  • Derive the sigmoid function and decision boundary
  • Implement logistic regression from MLE perspective
  • Apply logistic regression to binary classification problems
  • Compare with linear regression for classification tasks
ā±ļø Estimated Time: 25–30 minutes reading + 50 minutes practice

Mathematical Foundation

āš ļø Important: Despite being called "regression," Logistic Regression is actually a classification algorithm for binary problems (0 or 1).

The Sigmoid Function

The core of logistic regression is the sigmoid (logistic) function, which maps any real number to a probability between 0 and 1:

Sigmoid Function:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Key properties of the sigmoid function:

  • Range: σ(z) ∈ (0, 1) for all real z
  • Symmetry: σ(-z) = 1 - σ(z)
  • Derivative: σ'(z) = σ(z)(1 - σ(z))
  • Asymptotes: lim(zā†’āˆž) σ(z) = 1, lim(z→-āˆž) σ(z) = 0

Logistic Regression Model

For binary classification, we model the probability of class 1:

Model:

$$P(y=1|\boldsymbol{x}) = \sigma(\boldsymbol{x}^T \boldsymbol{w} + w_0) = \frac{1}{1 + e^{-(\boldsymbol{x}^T \boldsymbol{w} + w_0)}}$$

Where:

  • $\boldsymbol{x}$ is the feature vector
  • $\boldsymbol{w}$ is the weight vector
  • $w_0$ is the bias term
  • $P(y=1|\boldsymbol{x})$ is the probability of class 1

Decision Boundary

The decision boundary occurs when the probability equals 0.5:

$$P(y=1|\boldsymbol{x}) = 0.5 \Rightarrow \boldsymbol{x}^T \boldsymbol{w} + w_0 = 0$$

This defines a hyperplane that separates the two classes.

Theoretical Foundation: Maximum Likelihood Estimation

Assumption

We assume that $y_i$ follows a Bernoulli distribution:

$$y_i | \boldsymbol{x}_i \sim \text{Bernoulli}(p_i)$$

Where $p_i = \sigma(\boldsymbol{x}_i^T \boldsymbol{w} + w_0)$.

Likelihood Function

The probability mass function for a Bernoulli random variable is:

$$P(y_i | \boldsymbol{x}_i, \boldsymbol{w}) = p_i^{y_i}(1-p_i)^{1-y_i}$$

For all $n$ observations, the likelihood function is:

$$L(\boldsymbol{w}) = \prod_{i=1}^{n} p_i^{y_i}(1-p_i)^{1-y_i}$$

Log-Likelihood

Taking the natural logarithm:

$$\ell(\boldsymbol{w}) = \sum_{i=1}^{n} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]$$

Cross-Entropy Loss

To minimize (instead of maximize), we use the negative log-likelihood:

$$J(\boldsymbol{w}) = -\frac{1}{n}\sum_{i=1}^{n} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]$$

This is exactly the binary cross-entropy loss function!

Gradient Derivation

The gradient of the loss with respect to $\boldsymbol{w}$ is:

$$\frac{\partial J}{\partial \boldsymbol{w}} = \frac{1}{n}\sum_{i=1}^{n} (p_i - y_i)\boldsymbol{x}_i$$

Derivation:

$$\frac{\partial J}{\partial \boldsymbol{w}} = -\frac{1}{n}\sum_{i=1}^{n} \left[\frac{y_i}{p_i} \frac{\partial p_i}{\partial \boldsymbol{w}} + \frac{1-y_i}{1-p_i} \frac{\partial (1-p_i)}{\partial \boldsymbol{w}}\right]$$
$$= -\frac{1}{n}\sum_{i=1}^{n} \left[\frac{y_i}{p_i} - \frac{1-y_i}{1-p_i}\right] \frac{\partial p_i}{\partial \boldsymbol{w}}$$
$$= -\frac{1}{n}\sum_{i=1}^{n} \frac{y_i - p_i}{p_i(1-p_i)} \cdot p_i(1-p_i) \boldsymbol{x}_i = \frac{1}{n}\sum_{i=1}^{n} (p_i - y_i)\boldsymbol{x}_i$$

Key Properties

šŸ“Š Probabilistic Output

Provides probability estimates, not just class predictions.

šŸŽÆ Linear Decision Boundary

Creates linear separations in feature space.

šŸ“ˆ Smooth Function

Sigmoid function is smooth and differentiable everywhere.

šŸ” Interpretable

Coefficients can be interpreted as log-odds ratios.

⚔ Fast Training

Convex optimization problem with unique global minimum.

🚫 No Assumptions

No assumptions about feature distributions (unlike LDA).

Applications

  • Healthcare: Disease diagnosis, treatment outcome prediction, medical image analysis
  • Finance: Credit scoring, fraud detection, loan approval
  • Marketing: Customer churn prediction, conversion analysis, A/B testing
  • Engineering: Quality control, failure prediction, defect detection
  • Social Sciences: Survey analysis, behavioral prediction, policy evaluation

Interactive Visualization

Explore the sigmoid function and decision boundary:

Comparison with Linear Regression

Aspect Linear Regression Logistic Regression
Problem Type Regression (continuous output) Classification (binary output)
Output Range (-āˆž, +āˆž) (0, 1)
Activation Function Linear (identity) Sigmoid
Loss Function MSE Cross-Entropy
Optimization Closed-form solution available Iterative optimization required
Assumptions Gaussian errors Bernoulli distribution

šŸ’» Code Examples

NumPy, scikit-learn, and PyTorch implementations

šŸ“Š Softmax Regression

Extension to multiclass classification

šŸ‹ļø Exercises

Hands-on practice problems

Detailed Example: Email Spam Detection

Let's work through a practical example of detecting spam emails based on word frequencies.

Sample Data

Email "Free" Count "Money" Count Is Spam?
1 3 1 1 (Spam)
2 1 0 0 (Not Spam)
3 5 2 1 (Spam)
4 0 0 0 (Not Spam)
5 2 1 1 (Spam)

Model Setup

We want to predict spam probability based on word counts:

$$P(\text{Spam}|\boldsymbol{x}) = \sigma(w_0 + w_1 \times \text{"Free"} + w_2 \times \text{"Money"})$$

Training Process

Using gradient descent to minimize cross-entropy loss:

$$J(\boldsymbol{w}) = -\frac{1}{n}\sum_{i=1}^{n} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]$$

Interpretation

After training, we might get:

$$\boldsymbol{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} -2.5 \\ 1.2 \\ 0.8 \end{bmatrix}$$

This means:

  • Bias $w_0$ (-2.5): Base log-odds when no keywords present
  • "Free" coefficient $w_1$ (1.2): Each "free" word increases log-odds by 1.2
  • "Money" coefficient $w_2$ (0.8): Each "money" word increases log-odds by 0.8

Prediction Example

For an email with 2 "free" words and 1 "money" word:

$$z = -2.5 + 1.2 \times 2 + 0.8 \times 1 = 0.7$$
$$P(\text{Spam}) = \sigma(0.7) = \frac{1}{1 + e^{-0.7}} = 0.67$$

So there's a 67% chance this email is spam.