๐ŸŽฏ Naive Bayes Tutorial

Interactive tutorial on probabilistic classification using Bayes' theorem with independence assumption

๐Ÿ“Š Bayes' Theorem

Understanding the fundamental formula behind Naive Bayes

Mathematical Foundation

$$P(C|X) = \frac{P(X|C) \times P(C)}{P(X)}$$

$P(C|X)$ - Posterior

Probability of class C given sample X

$P(X|C)$ - Likelihood

Probability of observing X when class is C

$P(C)$ - Prior

Prior probability of class C

$P(X)$ - Evidence

Probability of the sample X (normalizing constant)

Interactive Bayes Calculator

0.50
0.70
0.30
0.50

Posterior $P(C|X) = $ 0.58

Real-world Example: Medical Diagnosis

Scenario: A patient tests positive for a disease. What's the probability they actually have the disease?

  • Prior $P(\text{Disease})$: 1% of population has this disease
  • Likelihood $P(\text{Positive}|\text{Disease})$: 95% accuracy when disease is present
  • Evidence $P(\text{Positive})$: 5.94% of all tests are positive
$$P(\text{Disease}|\text{Positive}) = \frac{0.95 \times 0.01}{0.0594} = \mathbf{0.16}$$

Only 16% chance the patient actually has the disease!

๐ŸŽฏ Naive Bayes Classification

Applying Bayes' theorem to classification with independence assumption

Naive Bayes Network

Note: To complete the model description, we need to estimate P(C) and P(Xแตข|C) for all d features.

Independence Assumption

$$P(X_1, X_2, \ldots, X_d | C) = \prod_{i=1}^{d} P(X_i | C)$$

The "Naive" Assumption: Features are conditionally independent given the class.

This means: If we know the class, knowing one feature doesn't tell us anything about another feature.

Mathematical Formulation

Step 1: Apply Bayes' Theorem
$$P(C|\mathbf{x}) = \frac{P(C) \times P(\mathbf{x}|C)}{P(\mathbf{x})}$$

For classification, we need to find the most probable class given the feature vector.

โ†“ Apply Naive Independence Assumption
Step 2: Apply Naive Independence
$$P(\mathbf{x}|C) = \prod_{i=1}^{d} P(x_i|C)$$

Features are conditionally independent given the class, so joint probability becomes product of individual probabilities.

โ†“ Substitute into Bayes' theorem
Step 3: Combine and Simplify
$$P(C|\mathbf{x}) = \frac{P(C) \times \prod_{i=1}^{d} P(x_i|C)}{P(\mathbf{x})}$$

Since P(x) is constant for all classes, we can ignore it for classification.

โ†“ P(x) is independent of C
Step 4: Final Classification Rule
$$P(C|\mathbf{x}) \propto P(C) \times \prod_{i=1}^{d} P(x_i|C)$$

We only need the proportional relationship for finding the maximum.

โ†“
Step 5: Prediction Rule
$$\hat{c} = \arg\max_{C} P(C|\mathbf{x}) = \arg\max_{C} \left[P(C) \times \prod_{i=1}^{d} P(x_i|C)\right]$$

Choose the class that maximizes the posterior probability.

๐ŸŒธ Iris Dataset Analysis

Gaussian Naive Bayes on the famous Iris flower dataset

Overview

Iris Dataset: 150 samples, 4 features, 3 classes

  • Features: sepal_length, sepal_width, petal_length, petal_width
  • Classes: setosa, versicolor, virginica
  • Data Type: Continuous (suitable for Gaussian NB)

Naive Bayes Network for Iris Dataset

Feature Mapping
Variable Column Name Description Unit
Xโ‚ sepal_length Length of sepal cm
Xโ‚‚ sepal_width Width of sepal cm
Xโ‚ƒ petal_length Length of petal cm
Xโ‚„ petal_width Width of petal cm

Code Implementation

Complete implementation of Gaussian Naive Bayes on Iris dataset. This code demonstrates the full pipeline from data loading to model evaluation.

Complete Implementation

Loading code from iris_naivebayes.py...

๐Ÿ“ฐ BBC News Classification

Multinomial Naive Bayes for text classification

Overview

BBC News Dataset: 2,225 articles, 5 categories

  • Categories: business, entertainment, politics, sport, tech
  • Data Type: Text (suitable for Multinomial NB)
  • Preprocessing: TF-IDF vectorization, stop words removal
  • Feature Space: 10,000+ word dimensions

Naive Bayes Network for BBC News

Interactive Feature Explorer

Code Implementation

Complete implementation of Multinomial Naive Bayes on BBC News dataset. This code demonstrates text preprocessing, TF-IDF vectorization, and classification.

Complete Implementation

Loading code from bbc_naivebayes.py...

๐Ÿš€ Google Colab Notebook

Run this BBC News classification example in Google Colab with free GPU/TPU access

๐Ÿ“ Text BBC News Classification with Multinomial Naive Bayes

This notebook demonstrates text classification using Multinomial Naive Bayes on the BBC News dataset. We'll cover:

  • Text preprocessing and TF-IDF vectorization
  • Multinomial Naive Bayes model training
  • Model evaluation with confusion matrix
  • Feature importance analysis
  • Interactive visualizations with Plotly
๐Ÿ Code Dataset Loading
# Load BBC News dataset from GitHub Pages
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import plotly.graph_objects as go
import plotly.express as px

# Load data
train_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/train.csv"
val_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/val.csv"
test_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/test.csv"

train_df = pd.read_csv(train_url)
val_df = pd.read_csv(val_url)
test_df = pd.read_csv(test_url)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")
๐Ÿ“Š Output Results

Training samples: 1557

Validation samples: 334

Test samples: 334

Model Accuracy: 98.2%

โœจ Colab Features:
  • ๐Ÿ†“ Free GPU/TPU access
  • ๐Ÿ“Š Interactive Plotly visualizations
  • ๐Ÿ’พ Save and share your results
  • ๐Ÿ”„ Real-time collaboration
  • ๐Ÿ“ฑ Mobile-friendly interface

๐Ÿ„ Mushroom Classification

Categorical Naive Bayes for mushroom safety classification

Overview

Mushroom Dataset: 8,124 samples, 22 categorical features

  • Features: cap-shape, cap-surface, cap-color, bruises, odor, etc.
  • Target: edible (e) or poisonous (p)
  • Data Type: Categorical (suitable for Categorical NB)
  • Application: Food safety classification

Naive Bayes Network for Mushroom Dataset

Feature Mapping
Variable Feature Name Possible Values Description
Xโ‚ cap-shape bell, conical, flat, knobbed, sunken, convex Shape of the mushroom cap
Xโ‚‚ cap-surface fibrous, grooves, scaly, smooth Surface texture of the cap
Xโ‚ƒ cap-color brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow Color of the mushroom cap
Xโ‚„ bruises bruises, no Whether the mushroom bruises
Xโ‚… odor almond, anise, creosote, fishy, foul, musty, none, pungent, spicy Odor of the mushroom

๐Ÿ„ Interactive Mushroom Safety Checker

Select mushroom characteristics to predict if it's safe to eat. โš ๏ธ This is for educational purposes only - never rely on this for real mushroom identification!

Code Implementation

Complete implementation of Categorical Naive Bayes on Mushroom dataset. This code demonstrates categorical encoding, CategoricalNB training, and safety prediction.

Complete Implementation

Loading code from mushroom_naivebayes.py...

๐Ÿš€ Google Colab Notebook

Run this Mushroom classification example in Google Colab with free GPU/TPU access

๐Ÿ“ Text Mushroom Classification with Categorical Naive Bayes

This notebook demonstrates categorical classification using Categorical Naive Bayes on the Mushroom dataset. We'll cover:

  • UCI Mushroom dataset download and preprocessing
  • Categorical feature encoding with LabelEncoder
  • CategoricalNB model training and evaluation
  • Feature importance analysis for safety prediction
  • Interactive safety checker demo
๐Ÿ Code Dataset Loading
# Load UCI Mushroom dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import urllib.request
import os

# Download dataset from UCI ML Repository
def download_mushroom_dataset():
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
    filename = "mushroom_dataset.csv"
    
    if not os.path.exists(filename):
        print("Downloading UCI Mushroom dataset...")
        urllib.request.urlretrieve(url, filename)
        print("Download completed!")
    else:
        print("Dataset already exists locally")
    
    return filename

# Load and preprocess data
filename = download_mushroom_dataset()
df = pd.read_csv(filename, header=None)

print(f"Dataset shape: {df.shape}")
print(f"Features: {df.shape[1] - 1}")
print(f"Samples: {df.shape[0]}")
๐Ÿ“Š Output Results

Dataset shape: (8124, 23)

Features: 22

Samples: 8124

Model Accuracy: 94.0%

Safety Checker: Ready!

โœจ Colab Features:
  • ๐Ÿ†“ Free GPU/TPU access
  • ๐Ÿ„ Interactive safety checker
  • ๐Ÿ“Š Feature importance analysis
  • ๐Ÿ’พ Save and share your results
  • ๐Ÿ”„ Real-time collaboration
  • ๐Ÿ“ฑ Mobile-friendly interface

๐Ÿ”„ Naive Bayes Variants

Different types of Naive Bayes for different data types

Variants Comparison

๐ŸŒบ Gaussian Naive Bayes

For continuous data (e.g., measurements, Iris dataset)

Data Type: Continuous/Numerical
Example: Iris flower measurements
Distribution: Normal (Gaussian)
Code Example:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load Iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)
accuracy = gnb.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

๐Ÿ“ฐ Multinomial Naive Bayes

For discrete counts (e.g., word frequencies, text classification)

Data Type: Discrete Counts
Example: BBC News text classification
Distribution: Multinomial
Code Example:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Text data
texts = ["This is a business article", "Sports news here"]
labels = ["business", "sports"]

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X, labels)

# Predict new text
new_text = ["Economic growth continues"]
X_new = vectorizer.transform([new_text])
prediction = mnb.predict(X_new)[0]
print(f"Prediction: {prediction}")

๐Ÿ„ Categorical Naive Bayes

For categorical data (e.g., mushroom features, survey responses)

Data Type: Categorical
Example: Mushroom safety classification
Distribution: Categorical
Code Example:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Categorical data
df = pd.DataFrame({
    'cap_shape': ['convex', 'flat', 'convex'],
    'odor': ['none', 'foul', 'none'],
    'class': ['edible', 'poisonous', 'edible']
})

# Encode categorical features
le = LabelEncoder()
X = df[['cap_shape', 'odor']].apply(le.fit_transform)
y = le.fit_transform(df['class'])

# Train Categorical Naive Bayes
cnb = CategoricalNB()
cnb.fit(X, y)

# Make predictions
predictions = cnb.predict(X)
print(f"Predictions: {predictions}")

๐Ÿ”ข Bernoulli Naive Bayes

For binary features (e.g., word presence, yes/no responses)

Data Type: Binary (0/1)
Example: Spam email detection
Distribution: Bernoulli
Code Example:
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# Binary text features
texts = ["free money now", "meeting tomorrow"]
labels = ["spam", "ham"]

# Create binary features
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(texts)

# Train Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X, labels)

# Predict new text
new_text = ["urgent free offer"]
X_new = vectorizer.transform([new_text])
prediction = bnb.predict(X_new)[0]
print(f"Prediction: {prediction}")