Pipeline Comparison - BBC News Classification

📝

Text Input

→

🔢

Feature Extraction

BoW or TF-IDF

→

🎯

Dim. Reduction

None, Chi², or PCA

→

🤖

Classifier

NB, LR, RF, DT, or KNN

→

✨

Results

240 Total Pipelines

•

99.4% Best Accuracy

•

0.46s Fastest Training

🧮 Total Pipeline Combinations

Total = Extractors × Reducers × Classifiers

= 6 × 4 × 10 = 240 pipelines

1️⃣ Feature Extractors (6 configs)

Bag of Words (BoW) - 3 configs

10000 features, unigrams (1,1)
5000 features, bigrams (1,2)
10000 features, bigrams (1,2)

TF-IDF - 3 configs

10000 features, unigrams (1,1)
5000 features, bigrams (1,2)
10000 features, bigrams (1,2)

2️⃣ Dimensionality Reducers (4 configs)

None - 1 config

Use all features (no reduction)

Chi² - 1 config

Select top 1000 features

PCA - 2 configs

90% variance explained
95% variance explained

3️⃣ Classifiers (10 configs across 5 models)

Naive Bayes - 1 config

alpha=1.0

Logistic Regression - 2 configs

C=1.0, C=10.0

Random Forest - 2 configs

max_depth=10, 20

Decision Tree - 2 configs

max_depth=10, 20

K-Nearest Neighbors - 3 configs

n_neighbors=3, 5, 10

📝 Example Pipeline

TF-IDF (10000, unigrams)

→

PCA (90% variance)

→

Logistic Regression (C=1.0)

1 of 240 pipelines

🔢 Feature Extraction

Bag of Words (BoW)

Counts word occurrences in documents. Simple but effective baseline.

📚 Docs

TF-IDF

Weighs words by importance (frequency × uniqueness).

📚 Docs

🎯 Dimensionality Reduction

Chi² (Chi-Square)

Selects top K most relevant features for classification.

📚 Docs

PCA (Principal Component Analysis)

Reduces dimensions while preserving variance (90% or 95%).

📚 Docs

🤖 Classifiers

Naive Bayes

Probabilistic classifier based on Bayes' theorem. Fast and works well with text data.

📚 Docs

Logistic Regression

Linear model for classification. Simple, interpretable, and often the best baseline.

📚 Docs

Random Forest

Ensemble of decision trees. Robust and handles non-linear patterns well.

📚 Docs

Decision Tree

Tree-based classifier. Simple and interpretable, but can overfit.

📚 Docs

K-Nearest Neighbors (KNN)

Instance-based learning. Classifies based on similarity to training examples.

📚 Docs

💻 Full Implementation Code

Ready-to-run Python code with all 240 pipeline combinations. Includes dataset download, feature extraction, dimensionality reduction, classification, and evaluation.

⬇️ Download

⏳ Loading code...

Features:

✅ Automatic dataset download from GitHub Pages
✅ 6 feature extractors (BoW + TF-IDF with different configs)
✅ 4 dimensionality reducers (None, Chi², PCA 90%, PCA 95%)
✅ 10 classifier configurations (NB, LR, RF, DT, KNN)
✅ Training & inference time measurements
✅ Performance comparison table with best highlighting

🚀

Run Pipeline Comparison in Google Colab

Interactive notebook to explore and compare all 240 pipelines.

Open Pipeline Comparison in Colab

🏆 Top 9 Performers Comparison

Top 3 by Accuracy, Training Speed, and Inference Speed

🔬 Pipeline Comparison

🧮 Total Pipeline Combinations

1️⃣ Feature Extractors (6 configs)

2️⃣ Dimensionality Reducers (4 configs)

3️⃣ Classifiers (10 configs across 5 models)

📝 Example Pipeline

🔢 Feature Extraction

Bag of Words (BoW)

TF-IDF

🎯 Dimensionality Reduction

Chi² (Chi-Square)

PCA (Principal Component Analysis)

🤖 Classifiers

Naive Bayes

Logistic Regression

Random Forest

Decision Tree

K-Nearest Neighbors (KNN)

💻 Full Implementation Code

Features:

Run Pipeline Comparison in Google Colab

🏆 Top 9 Performers Comparison

🎯 Confusion Matrices & Per-Class Metrics

🔬 Build Your Pipeline

STEP 1: Feature Extraction

STEP 2: Dimensionality Reduction (Optional)

STEP 3: Classifiers (Select multiple)

📊 Preview

📊 Comparison Results

📊 Accuracy Heatmap

⚖️ Trade-off: Accuracy vs Speed

Pipeline Details