BERT for Text Classification

📝

Text Input

"Business news..."

→

🎯 FEATURE EXTRACTOR Choose Below

🔤

Tokenizer

WordPiece

→

🧠

BERT Encoder

DistilBERT | BERT | TinyBERT

66M | 110M | 14M

→

🎚️

Pooling

[CLS] | Mean | Pooler

Output: 768-dimensional sentence vector

→

🤖

Task Head

Classification

Linear

768 → 5 classes

→

✨

Results

90-98% accuracy

🤖 BERT Model Variants

🧠

BERT-base (bert-base-uncased)

Bidirectional Encoder Representations from Transformers - Base version with 12 transformer layers. Pre-trained on massive text corpora using masked language modeling (MLM) and next sentence prediction (NSP).

Architecture: 12 layers, 768 hidden size, 12 attention heads, 109.5M parameters

🤗 Hugging Face Model Card 📄 Original BERT Paper

⚡

DistilBERT (distilbert-base-uncased)

Distilled version of BERT - 40% smaller, 60% faster, retains 97% of BERT's performance. Uses knowledge distillation during pre-training to compress the model.

Architecture: 6 layers, 768 hidden size, 12 attention heads, 66.4M parameters

🤗 Hugging Face Model Card 📄 DistilBERT Paper

🚀

TinyBERT (huawei-noah/TinyBERT_General_4L_312D)

Ultra-compact BERT - 87% smaller, 7.5x faster inference. Uses two-stage knowledge distillation (general + task-specific) for extreme compression.

Architecture: 4 layers, 312 hidden size, 12 attention heads, 14.4M parameters

🤗 Hugging Face Model Card 📄 TinyBERT Paper

🎯 Pooling Strategies

📌

CLS Token Pooling

Uses the final hidden state of the special [CLS] token (first token). BERT is pre-trained to encode sentence-level information in this token.

Formula: output = last_hidden_state[0, :] (shape: [batch_size, hidden_size])

📚 Transformers Output Documentation

🔵

Pooler Output

Takes [CLS] token hidden state and passes through additional dense layer + tanh activation. Provides a refined sentence representation optimized during pre-training.

Formula: output = tanh(dense(last_hidden_state[0, :]))

📚 Pooler Output Documentation

📊

Mean Pooling

Averages hidden states across all tokens (excluding padding). Captures information from entire sequence, not just [CLS] token.

Formula: output = mean(last_hidden_state * attention_mask) / sum(attention_mask)

📚 Mean Pooling Guide

📝 Python Implementation

Complete code to fine-tune BERT models (BERT-base, DistilBERT, TinyBERT) with different pooling strategies for BBC News classification.

⬇️ Download

⏳ Loading code...

🚀 Run in Google Colab

Open this notebook in Google Colab to experiment with BERT fine-tuning for text classification. Includes all 8 configurations (3 models × 3 pooling strategies).

✓ Free GPU/TPU access

✓ Pre-installed libraries

✓ Interactive execution

✓ Modify & experiment

Open Notebook in Colab

📚 What's included:

Dataset loading and preprocessing (BBC News)
Tokenization with BERT tokenizers
Model fine-tuning for 8 configurations
Training with PyTorch/Transformers
Evaluation metrics and confusion matrices
Pooling strategy comparisons
Model performance analysis

🤖 BERT for Text Classification