๐ฏ Naive Bayes Tutorial
Interactive tutorial on probabilistic classification using Bayes' theorem with independence assumption
๐ Bayes' Theorem
Understanding the fundamental formula behind Naive Bayes
Mathematical Foundation
$P(C|X)$ - Posterior
Probability of class C given sample X
$P(X|C)$ - Likelihood
Probability of observing X when class is C
$P(C)$ - Prior
Prior probability of class C
$P(X)$ - Evidence
Probability of the sample X (normalizing constant)
Interactive Bayes Calculator
Posterior $P(C|X) = $ 0.58
Real-world Example: Medical Diagnosis
Scenario: A patient tests positive for a disease. What's the probability they actually have the disease?
- Prior $P(\text{Disease})$: 1% of population has this disease
- Likelihood $P(\text{Positive}|\text{Disease})$: 95% accuracy when disease is present
- Evidence $P(\text{Positive})$: 5.94% of all tests are positive
Only 16% chance the patient actually has the disease!
๐ฏ Naive Bayes Classification
Applying Bayes' theorem to classification with independence assumption
Naive Bayes Network
Note: To complete the model description, we need to estimate P(C) and P(Xแตข|C) for all d features.
Independence Assumption
The "Naive" Assumption: Features are conditionally independent given the class.
This means: If we know the class, knowing one feature doesn't tell us anything about another feature.
Mathematical Formulation
Step 1: Apply Bayes' Theorem
For classification, we need to find the most probable class given the feature vector.
Step 2: Apply Naive Independence
Features are conditionally independent given the class, so joint probability becomes product of individual probabilities.
Step 3: Combine and Simplify
Since P(x) is constant for all classes, we can ignore it for classification.
Step 4: Final Classification Rule
We only need the proportional relationship for finding the maximum.
Step 5: Prediction Rule
Choose the class that maximizes the posterior probability.
๐ธ Iris Dataset Analysis
Gaussian Naive Bayes on the famous Iris flower dataset
Overview
Iris Dataset: 150 samples, 4 features, 3 classes
- Features: sepal_length, sepal_width, petal_length, petal_width
- Classes: setosa, versicolor, virginica
- Data Type: Continuous (suitable for Gaussian NB)
Naive Bayes Network for Iris Dataset
Feature Mapping
| Variable | Column Name | Description | Unit |
|---|---|---|---|
| Xโ | sepal_length | Length of sepal | cm |
| Xโ | sepal_width | Width of sepal | cm |
| Xโ | petal_length | Length of petal | cm |
| Xโ | petal_width | Width of petal | cm |
Code Implementation
Complete implementation of Gaussian Naive Bayes on Iris dataset. This code demonstrates the full pipeline from data loading to model evaluation.
Complete Implementation
Loading code from iris_naivebayes.py...
๐ฐ BBC News Classification
Multinomial Naive Bayes for text classification
Overview
BBC News Dataset: 2,225 articles, 5 categories
- Categories: business, entertainment, politics, sport, tech
- Data Type: Text (suitable for Multinomial NB)
- Preprocessing: TF-IDF vectorization, stop words removal
- Feature Space: 10,000+ word dimensions
Naive Bayes Network for BBC News
Interactive Feature Explorer
Code Implementation
Complete implementation of Multinomial Naive Bayes on BBC News dataset. This code demonstrates text preprocessing, TF-IDF vectorization, and classification.
Complete Implementation
Loading code from bbc_naivebayes.py...
๐ Google Colab Notebook
Run this BBC News classification example in Google Colab with free GPU/TPU access
This notebook demonstrates text classification using Multinomial Naive Bayes on the BBC News dataset. We'll cover:
- Text preprocessing and TF-IDF vectorization
- Multinomial Naive Bayes model training
- Model evaluation with confusion matrix
- Feature importance analysis
- Interactive visualizations with Plotly
# Load BBC News dataset from GitHub Pages
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import plotly.graph_objects as go
import plotly.express as px
# Load data
train_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/train.csv"
val_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/val.csv"
test_url = "https://ltsach.github.io/AILearningHub/datasets/bbcnews/data/test.csv"
train_df = pd.read_csv(train_url)
val_df = pd.read_csv(val_url)
test_df = pd.read_csv(test_url)
print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")
Training samples: 1557
Validation samples: 334
Test samples: 334
Model Accuracy: 98.2%
โจ Colab Features:
- ๐ Free GPU/TPU access
- ๐ Interactive Plotly visualizations
- ๐พ Save and share your results
- ๐ Real-time collaboration
- ๐ฑ Mobile-friendly interface
๐ Mushroom Classification
Categorical Naive Bayes for mushroom safety classification
Overview
Mushroom Dataset: 8,124 samples, 22 categorical features
- Features: cap-shape, cap-surface, cap-color, bruises, odor, etc.
- Target: edible (e) or poisonous (p)
- Data Type: Categorical (suitable for Categorical NB)
- Application: Food safety classification
Naive Bayes Network for Mushroom Dataset
Feature Mapping
| Variable | Feature Name | Possible Values | Description |
|---|---|---|---|
| Xโ | cap-shape | bell, conical, flat, knobbed, sunken, convex | Shape of the mushroom cap |
| Xโ | cap-surface | fibrous, grooves, scaly, smooth | Surface texture of the cap |
| Xโ | cap-color | brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow | Color of the mushroom cap |
| Xโ | bruises | bruises, no | Whether the mushroom bruises |
| Xโ | odor | almond, anise, creosote, fishy, foul, musty, none, pungent, spicy | Odor of the mushroom |
๐ Interactive Mushroom Safety Checker
Select mushroom characteristics to predict if it's safe to eat. โ ๏ธ This is for educational purposes only - never rely on this for real mushroom identification!
Code Implementation
Complete implementation of Categorical Naive Bayes on Mushroom dataset. This code demonstrates categorical encoding, CategoricalNB training, and safety prediction.
Complete Implementation
Loading code from mushroom_naivebayes.py...
๐ Google Colab Notebook
Run this Mushroom classification example in Google Colab with free GPU/TPU access
This notebook demonstrates categorical classification using Categorical Naive Bayes on the Mushroom dataset. We'll cover:
- UCI Mushroom dataset download and preprocessing
- Categorical feature encoding with LabelEncoder
- CategoricalNB model training and evaluation
- Feature importance analysis for safety prediction
- Interactive safety checker demo
# Load UCI Mushroom dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import urllib.request
import os
# Download dataset from UCI ML Repository
def download_mushroom_dataset():
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
filename = "mushroom_dataset.csv"
if not os.path.exists(filename):
print("Downloading UCI Mushroom dataset...")
urllib.request.urlretrieve(url, filename)
print("Download completed!")
else:
print("Dataset already exists locally")
return filename
# Load and preprocess data
filename = download_mushroom_dataset()
df = pd.read_csv(filename, header=None)
print(f"Dataset shape: {df.shape}")
print(f"Features: {df.shape[1] - 1}")
print(f"Samples: {df.shape[0]}")
Dataset shape: (8124, 23)
Features: 22
Samples: 8124
Model Accuracy: 94.0%
Safety Checker: Ready!
โจ Colab Features:
- ๐ Free GPU/TPU access
- ๐ Interactive safety checker
- ๐ Feature importance analysis
- ๐พ Save and share your results
- ๐ Real-time collaboration
- ๐ฑ Mobile-friendly interface
๐ Naive Bayes Variants
Different types of Naive Bayes for different data types
Variants Comparison
๐บ Gaussian Naive Bayes
For continuous data (e.g., measurements, Iris dataset)
Code Example:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load Iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Make predictions
y_pred = gnb.predict(X_test)
accuracy = gnb.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")
๐ฐ Multinomial Naive Bayes
For discrete counts (e.g., word frequencies, text classification)
Code Example:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
# Text data
texts = ["This is a business article", "Sports news here"]
labels = ["business", "sports"]
# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X, labels)
# Predict new text
new_text = ["Economic growth continues"]
X_new = vectorizer.transform([new_text])
prediction = mnb.predict(X_new)[0]
print(f"Prediction: {prediction}")
๐ Categorical Naive Bayes
For categorical data (e.g., mushroom features, survey responses)
Code Example:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Categorical data
df = pd.DataFrame({
'cap_shape': ['convex', 'flat', 'convex'],
'odor': ['none', 'foul', 'none'],
'class': ['edible', 'poisonous', 'edible']
})
# Encode categorical features
le = LabelEncoder()
X = df[['cap_shape', 'odor']].apply(le.fit_transform)
y = le.fit_transform(df['class'])
# Train Categorical Naive Bayes
cnb = CategoricalNB()
cnb.fit(X, y)
# Make predictions
predictions = cnb.predict(X)
print(f"Predictions: {predictions}")
๐ข Bernoulli Naive Bayes
For binary features (e.g., word presence, yes/no responses)
Code Example:
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
# Binary text features
texts = ["free money now", "meeting tomorrow"]
labels = ["spam", "ham"]
# Create binary features
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(texts)
# Train Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X, labels)
# Predict new text
new_text = ["urgent free offer"]
X_new = vectorizer.transform([new_text])
prediction = bnb.predict(X_new)[0]
print(f"Prediction: {prediction}")