📝
Text Input
🔢
Feature Extraction
BoW or TF-IDF
🎯
Dim. Reduction
None, Chi², or PCA
🤖
Classifier
NB, LR, RF, DT, or KNN
Results
240 Total Pipelines
99.4% Best Accuracy
0.46s Fastest Training

🧮 Total Pipeline Combinations

Total = Extractors × Reducers × Classifiers
= 6 × 4 × 10 = 240 pipelines

1️⃣ Feature Extractors (6 configs)

Bag of Words (BoW) - 3 configs
  • 10000 features, unigrams (1,1)
  • 5000 features, bigrams (1,2)
  • 10000 features, bigrams (1,2)
TF-IDF - 3 configs
  • 10000 features, unigrams (1,1)
  • 5000 features, bigrams (1,2)
  • 10000 features, bigrams (1,2)

2️⃣ Dimensionality Reducers (4 configs)

None - 1 config
  • Use all features (no reduction)
Chi² - 1 config
  • Select top 1000 features
PCA - 2 configs
  • 90% variance explained
  • 95% variance explained

3️⃣ Classifiers (10 configs across 5 models)

Naive Bayes - 1 config
  • alpha=1.0
Logistic Regression - 2 configs
  • C=1.0, C=10.0
Random Forest - 2 configs
  • max_depth=10, 20
Decision Tree - 2 configs
  • max_depth=10, 20
K-Nearest Neighbors - 3 configs
  • n_neighbors=3, 5, 10

📝 Example Pipeline

TF-IDF (10000, unigrams)
PCA (90% variance)
Logistic Regression (C=1.0)
=
1 of 240 pipelines

🔢 Feature Extraction

Bag of Words (BoW)

Counts word occurrences in documents. Simple but effective baseline.

📚 Docs

TF-IDF

Weighs words by importance (frequency × uniqueness).

📚 Docs

🎯 Dimensionality Reduction

Chi² (Chi-Square)

Selects top K most relevant features for classification.

📚 Docs

PCA (Principal Component Analysis)

Reduces dimensions while preserving variance (90% or 95%).

📚 Docs

🤖 Classifiers

Naive Bayes

Probabilistic classifier based on Bayes' theorem. Fast and works well with text data.

📚 Docs

Logistic Regression

Linear model for classification. Simple, interpretable, and often the best baseline.

📚 Docs

Random Forest

Ensemble of decision trees. Robust and handles non-linear patterns well.

📚 Docs

Decision Tree

Tree-based classifier. Simple and interpretable, but can overfit.

📚 Docs

K-Nearest Neighbors (KNN)

Instance-based learning. Classifies based on similarity to training examples.

📚 Docs

💻 Full Implementation Code

Ready-to-run Python code with all 240 pipeline combinations. Includes dataset download, feature extraction, dimensionality reduction, classification, and evaluation.

⏳ Loading code...

Features:

  • ✅ Automatic dataset download from GitHub Pages
  • ✅ 6 feature extractors (BoW + TF-IDF with different configs)
  • ✅ 4 dimensionality reducers (None, Chi², PCA 90%, PCA 95%)
  • ✅ 10 classifier configurations (NB, LR, RF, DT, KNN)
  • ✅ Training & inference time measurements
  • ✅ Performance comparison table with best highlighting
🚀

Run Pipeline Comparison in Google Colab

Interactive notebook to explore and compare all 240 pipelines.

🏆 Top 9 Performers Comparison

Top 3 by Accuracy, Training Speed, and Inference Speed

🎯 Confusion Matrices & Per-Class Metrics