🎯 Naive Bayes Tutorial

Interactive tutorial on probabilistic classification using Bayes' theorem with independence assumption

📊 Bayes' Theorem

Understanding the fundamental formula behind Naive Bayes

Mathematical Foundation

$$P(C|X) = \frac{P(X|C) \times P(C)}{P(X)}$$

$P(C|X)$ - Posterior

Probability of class C given sample X

$P(X|C)$ - Likelihood

Probability of observing X when class is C

$P(C)$ - Prior

Prior probability of class C

$P(X)$ - Evidence

Probability of the sample X (normalizing constant)

Interactive Bayes Calculator

Prior $P(C)$: 0.50

Likelihood $P(X|C)$: 0.70

Likelihood $P(X|\neg C)$: 0.30

Evidence $P(X)$ (calculated): 0.50

Posterior $P(C|X) = $ 0.58

Real-world Example: Medical Diagnosis

Scenario: A patient tests positive for a disease. What's the probability they actually have the disease?

Prior $P(\text{Disease})$: 1% of population has this disease
Likelihood $P(\text{Positive}|\text{Disease})$: 95% accuracy when disease is present
Evidence $P(\text{Positive})$: 5.94% of all tests are positive

$$P(\text{Disease}|\text{Positive}) = \frac{0.95 \times 0.01}{0.0594} = \mathbf{0.16}$$

Only 16% chance the patient actually has the disease!

🎯 Naive Bayes Classification

Applying Bayes' theorem to classification with independence assumption

Naive Bayes Network

Note: To complete the model description, we need to estimate P(C) and P(Xᵢ|C) for all d features.

Independence Assumption

$$P(X_1, X_2, \ldots, X_d | C) = \prod_{i=1}^{d} P(X_i | C)$$

The "Naive" Assumption: Features are conditionally independent given the class.

This means: If we know the class, knowing one feature doesn't tell us anything about another feature.

Mathematical Formulation

Step 1: Apply Bayes' Theorem

$$P(C|\mathbf{x}) = \frac{P(C) \times P(\mathbf{x}|C)}{P(\mathbf{x})}$$

For classification, we need to find the most probable class given the feature vector.

↓ Apply Naive Independence Assumption

Step 2: Apply Naive Independence

$$P(\mathbf{x}|C) = \prod_{i=1}^{d} P(x_i|C)$$

Features are conditionally independent given the class, so joint probability becomes product of individual probabilities.

↓ Substitute into Bayes' theorem

Step 3: Combine and Simplify

$$P(C|\mathbf{x}) = \frac{P(C) \times \prod_{i=1}^{d} P(x_i|C)}{P(\mathbf{x})}$$

Since P(x) is constant for all classes, we can ignore it for classification.

↓ P(x) is independent of C

Step 4: Final Classification Rule

$$P(C|\mathbf{x}) \propto P(C) \times \prod_{i=1}^{d} P(x_i|C)$$

We only need the proportional relationship for finding the maximum.

↓

Step 5: Prediction Rule

$$\hat{c} = \arg\max_{C} P(C|\mathbf{x}) = \arg\max_{C} \left[P(C) \times \prod_{i=1}^{d} P(x_i|C)\right]$$

Choose the class that maximizes the posterior probability.

🌸 Iris Dataset Analysis

Gaussian Naive Bayes on the famous Iris flower dataset

Overview

Iris Dataset: 150 samples, 4 features, 3 classes

Features: sepal_length, sepal_width, petal_length, petal_width
Classes: setosa, versicolor, virginica
Data Type: Continuous (suitable for Gaussian NB)

Naive Bayes Network for Iris Dataset

Feature Mapping

Variable	Column Name	Description	Unit
X₁	sepal_length	Length of sepal	cm
X₂	sepal_width	Width of sepal	cm
X₃	petal_length	Length of petal	cm
X₄	petal_width	Width of petal	cm

Code Implementation

Complete implementation of Gaussian Naive Bayes on Iris dataset. This code demonstrates the full pipeline from data loading to model evaluation.

Complete Implementation

Loading code from iris_naivebayes.py...

📰 BBC News Classification

Multinomial Naive Bayes for text classification

Overview

BBC News Dataset: 2,225 articles, 5 categories

Categories: business, entertainment, politics, sport, tech
Data Type: Text (suitable for Multinomial NB)
Preprocessing: TF-IDF vectorization, stop words removal
Feature Space: 10,000+ word dimensions

Naive Bayes Network for BBC News

Interactive Feature Explorer

Show top features:

Code Implementation

Complete implementation of Multinomial Naive Bayes on BBC News dataset. This code demonstrates text preprocessing, TF-IDF vectorization, and classification.

Complete Implementation

Loading code from bbc_naivebayes.py...

🚀 Google Colab Notebook

Run this BBC News classification example in Google Colab with free GPU/TPU access

📝 Text BBC News Classification with Multinomial Naive Bayes

This notebook demonstrates text classification using Multinomial Naive Bayes on the BBC News dataset. We'll cover:

Text preprocessing and TF-IDF vectorization
Multinomial Naive Bayes model training
Model evaluation with confusion matrix
Feature importance analysis
Interactive visualizations with Plotly

🐍 Code Dataset Loading

# Load BBC News dataset from GitHub Pages
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import plotly.graph_objects as go
import plotly.express as px

# Load data
train_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/train.csv"
val_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/val.csv"
test_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/test.csv"

train_df = pd.read_csv(train_url)
val_df = pd.read_csv(val_url)
test_df = pd.read_csv(test_url)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")

📊 Output Results

Training samples: 1557

Validation samples: 334

Test samples: 334

Model Accuracy: 98.2%

🚀 Open in Google Colab 📁 View on GitHub

✨ Colab Features:

🆓 Free GPU/TPU access
📊 Interactive Plotly visualizations
💾 Save and share your results
🔄 Real-time collaboration
📱 Mobile-friendly interface

🍄 Mushroom Classification

Categorical Naive Bayes for mushroom safety classification

Overview

Mushroom Dataset: 8,124 samples, 22 categorical features

Features: cap-shape, cap-surface, cap-color, bruises, odor, etc.
Target: edible (e) or poisonous (p)
Data Type: Categorical (suitable for Categorical NB)
Application: Food safety classification

Naive Bayes Network for Mushroom Dataset

Feature Mapping

Variable	Feature Name	Possible Values	Description
X₁	cap-shape	bell, conical, flat, knobbed, sunken, convex	Shape of the mushroom cap
X₂	cap-surface	fibrous, grooves, scaly, smooth	Surface texture of the cap
X₃	cap-color	brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow	Color of the mushroom cap
X₄	bruises	bruises, no	Whether the mushroom bruises
X₅	odor	almond, anise, creosote, fishy, foul, musty, none, pungent, spicy	Odor of the mushroom

🍄 Interactive Mushroom Safety Checker

Select mushroom characteristics to predict if it's safe to eat. ⚠️ This is for educational purposes only - never rely on this for real mushroom identification!

Cap Shape:

Cap Surface:

Cap Color:

Bruises:

Odor:

Code Implementation

Complete implementation of Categorical Naive Bayes on Mushroom dataset. This code demonstrates categorical encoding, CategoricalNB training, and safety prediction.

Complete Implementation

Loading code from mushroom_naivebayes.py...

🚀 Google Colab Notebook

Run this Mushroom classification example in Google Colab with free GPU/TPU access

📝 Text Mushroom Classification with Categorical Naive Bayes

This notebook demonstrates categorical classification using Categorical Naive Bayes on the Mushroom dataset. We'll cover:

UCI Mushroom dataset download and preprocessing
Categorical feature encoding with LabelEncoder
CategoricalNB model training and evaluation
Feature importance analysis for safety prediction
Interactive safety checker demo

🐍 Code Dataset Loading

# Load UCI Mushroom dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import urllib.request
import os

# Download dataset from UCI ML Repository
def download_mushroom_dataset():
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
    filename = "mushroom_dataset.csv"
    
    if not os.path.exists(filename):
        print("Downloading UCI Mushroom dataset...")
        urllib.request.urlretrieve(url, filename)
        print("Download completed!")
    else:
        print("Dataset already exists locally")
    
    return filename

# Load and preprocess data
filename = download_mushroom_dataset()
df = pd.read_csv(filename, header=None)

print(f"Dataset shape: {df.shape}")
print(f"Features: {df.shape[1] - 1}")
print(f"Samples: {df.shape[0]}")

📊 Output Results

Dataset shape: (8124, 23)

Features: 22

Samples: 8124

Model Accuracy: 94.0%

Safety Checker: Ready!

🚀 Open in Google Colab 📁 View on GitHub

✨ Colab Features:

🆓 Free GPU/TPU access
🍄 Interactive safety checker
📊 Feature importance analysis
💾 Save and share your results
🔄 Real-time collaboration
📱 Mobile-friendly interface

🔄 Naive Bayes Variants

Different types of Naive Bayes for different data types

Variants Comparison

🌺 Gaussian Naive Bayes

For continuous data (e.g., measurements, Iris dataset)

Data Type: Continuous/Numerical

Example: Iris flower measurements

Distribution: Normal (Gaussian)

Code Example:

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load Iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)
accuracy = gnb.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

📰 Multinomial Naive Bayes

For discrete counts (e.g., word frequencies, text classification)

Data Type: Discrete Counts

Example: BBC News text classification

Distribution: Multinomial

Code Example:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Text data
texts = ["This is a business article", "Sports news here"]
labels = ["business", "sports"]

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X, labels)

# Predict new text
new_text = ["Economic growth continues"]
X_new = vectorizer.transform([new_text])
prediction = mnb.predict(X_new)[0]
print(f"Prediction: {prediction}")

🍄 Categorical Naive Bayes

For categorical data (e.g., mushroom features, survey responses)

Data Type: Categorical

Example: Mushroom safety classification

Distribution: Categorical

Code Example:

from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Categorical data
df = pd.DataFrame({
    'cap_shape': ['convex', 'flat', 'convex'],
    'odor': ['none', 'foul', 'none'],
    'class': ['edible', 'poisonous', 'edible']
})

# Encode categorical features
le = LabelEncoder()
X = df[['cap_shape', 'odor']].apply(le.fit_transform)
y = le.fit_transform(df['class'])

# Train Categorical Naive Bayes
cnb = CategoricalNB()
cnb.fit(X, y)

# Make predictions
predictions = cnb.predict(X)
print(f"Predictions: {predictions}")

🔢 Bernoulli Naive Bayes

For binary features (e.g., word presence, yes/no responses)

Data Type: Binary (0/1)

Example: Spam email detection

Distribution: Bernoulli

Code Example:

from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# Binary text features
texts = ["free money now", "meeting tomorrow"]
labels = ["spam", "ham"]

# Create binary features
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(texts)

# Train Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X, labels)

# Predict new text
new_text = ["urgent free offer"]
X_new = vectorizer.transform([new_text])
prediction = bnb.predict(X_new)[0]
print(f"Prediction: {prediction}")