Logistic Regression

📋 Overview

Logistic Regression is a fundamental classification algorithm that models the probability of binary outcomes. Despite its name containing "regression," it's actually a classification algorithm that uses the logistic (sigmoid) function to map linear combinations of features to probabilities.

🎯 Learning Objectives

Understand the mathematical foundation of logistic regression
Derive the sigmoid function and decision boundary
Implement logistic regression from MLE perspective
Apply logistic regression to binary classification problems
Compare with linear regression for classification tasks

⏱️ Estimated Time: 25–30 minutes reading + 50 minutes practice

Mathematical Foundation

⚠️ Important: Despite being called "regression," Logistic Regression is actually a classification algorithm for binary problems (0 or 1).

The Sigmoid Function

The core of logistic regression is the sigmoid (logistic) function, which maps any real number to a probability between 0 and 1:

Sigmoid Function: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Key properties of the sigmoid function:

Range: σ(z) ∈ (0, 1) for all real z
Symmetry: σ(-z) = 1 - σ(z)
Derivative: σ'(z) = σ(z)(1 - σ(z))
Asymptotes: lim(z→∞) σ(z) = 1, lim(z→-∞) σ(z) = 0

Logistic Regression Model

For binary classification, we model the probability of class 1:

Model: $$P(y=1|\boldsymbol{x}) = \sigma(\boldsymbol{x}^T \boldsymbol{w} + w_0) = \frac{1}{1 + e^{-(\boldsymbol{x}^T \boldsymbol{w} + w_0)}}$$

Where:

$\boldsymbol{x}$ is the feature vector
$\boldsymbol{w}$ is the weight vector
$$w_0$$ is the bias term
$P(y=1|\boldsymbol{x})$ is the probability of class 1

Decision Boundary

The decision boundary occurs when the probability equals 0.5:

P(y=1|\boldsymbol{x}) = 0.5 \Rightarrow \boldsymbol{x}^T \boldsymbol{w} + w_0 = 0

This defines a hyperplane that separates the two classes.

Theoretical Foundation: Maximum Likelihood Estimation

Assumption

We assume that $$y_i$$ follows a Bernoulli distribution:

y_i | \boldsymbol{x}_i \sim \text{Bernoulli}(p_i)

Where $p_i = \sigma(\boldsymbol{x}_i^T \boldsymbol{w} + w_0)$ .

Likelihood Function

The probability mass function for a Bernoulli random variable is:

P(y_i | \boldsymbol{x}_i, \boldsymbol{w}) = p_i^{y_i}(1-p_i)^{1-y_i}

For all $$n$$ observations, the likelihood function is:

L(\boldsymbol{w}) = \prod_{i=1}^{n} p_i^{y_i}(1-p_i)^{1-y_i}

Log-Likelihood

Taking the natural logarithm:

\ell(\boldsymbol{w}) = \sum_{i=1}^{n} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]

Cross-Entropy Loss

To minimize (instead of maximize), we use the negative log-likelihood:

J(\boldsymbol{w}) = -\frac{1}{n}\sum_{i=1}^{n} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]

This is exactly the binary cross-entropy loss function!

Gradient Derivation

The gradient of the loss with respect to $\boldsymbol{w}$ is:

\frac{\partial J}{\partial \boldsymbol{w}} = \frac{1}{n}\sum_{i=1}^{n} (p_i - y_i)\boldsymbol{x}_i

Derivation:

\frac{\partial J}{\partial \boldsymbol{w}} = -\frac{1}{n}\sum_{i=1}^{n} \left[\frac{y_i}{p_i} \frac{\partial p_i}{\partial \boldsymbol{w}} + \frac{1-y_i}{1-p_i} \frac{\partial (1-p_i)}{\partial \boldsymbol{w}}\right]

= -\frac{1}{n}\sum_{i=1}^{n} \left[\frac{y_i}{p_i} - \frac{1-y_i}{1-p_i}\right] \frac{\partial p_i}{\partial \boldsymbol{w}}

= -\frac{1}{n}\sum_{i=1}^{n} \frac{y_i - p_i}{p_i(1-p_i)} \cdot p_i(1-p_i) \boldsymbol{x}_i = \frac{1}{n}\sum_{i=1}^{n} (p_i - y_i)\boldsymbol{x}_i

Key Properties

📊 Probabilistic Output

Provides probability estimates, not just class predictions.

🎯 Linear Decision Boundary

Creates linear separations in feature space.

📈 Smooth Function

Sigmoid function is smooth and differentiable everywhere.

🔍 Interpretable

Coefficients can be interpreted as log-odds ratios.

⚡ Fast Training

Convex optimization problem with unique global minimum.

🚫 No Assumptions

No assumptions about feature distributions (unlike LDA).

Applications

Healthcare: Disease diagnosis, treatment outcome prediction, medical image analysis
Finance: Credit scoring, fraud detection, loan approval
Marketing: Customer churn prediction, conversion analysis, A/B testing
Engineering: Quality control, failure prediction, defect detection
Social Sciences: Survey analysis, behavioral prediction, policy evaluation

Interactive Visualization

Explore the sigmoid function and decision boundary:

Comparison with Linear Regression

Aspect	Linear Regression	Logistic Regression
Problem Type	Regression (continuous output)	Classification (binary output)
Output Range	(-∞, +∞)	(0, 1)
Activation Function	Linear (identity)	Sigmoid
Loss Function	MSE	Cross-Entropy
Optimization	Closed-form solution available	Iterative optimization required
Assumptions	Gaussian errors	Bernoulli distribution

💻 Code Examples

NumPy, scikit-learn, and PyTorch implementations

📊 Softmax Regression

Extension to multiclass classification

🏋️ Exercises

Hands-on practice problems

Detailed Example: Email Spam Detection

Let's work through a practical example of detecting spam emails based on word frequencies.

Sample Data

Email	"Free" Count	"Money" Count	Is Spam?
1	3	1	1 (Spam)
2	1	0	0 (Not Spam)
3	5	2	1 (Spam)
4	0	0	0 (Not Spam)
5	2	1	1 (Spam)

Model Setup

We want to predict spam probability based on word counts:

P(\text{Spam}|\boldsymbol{x}) = \sigma(w_0 + w_1 \times \text{"Free"} + w_2 \times \text{"Money"})

Training Process

Using gradient descent to minimize cross-entropy loss:

J(\boldsymbol{w}) = -\frac{1}{n}\sum_{i=1}^{n} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]

Interpretation

After training, we might get:

\boldsymbol{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} = \begin{bmatrix} -2.5 \\ 1.2 \\ 0.8 \end{bmatrix}

This means:

Bias $w_0$ (-2.5): Base log-odds when no keywords present
"Free" coefficient $w_1$ (1.2): Each "free" word increases log-odds by 1.2
"Money" coefficient $w_2$ (0.8): Each "money" word increases log-odds by 0.8

Prediction Example

For an email with 2 "free" words and 1 "money" word:

z = -2.5 + 1.2 \times 2 + 0.8 \times 1 = 0.7

P(\text{Spam}) = \sigma(0.7) = \frac{1}{1 + e^{-0.7}} = 0.67

So there's a 67% chance this email is spam.