๐Ÿ“
Text Input
"Business news..."
โ†’
๐ŸŽฏ FEATURE EXTRACTOR Choose Below
๐Ÿ”ค
Tokenizer
WordPiece
โ†’
๐Ÿง 
BERT Encoder
DistilBERT | BERT | TinyBERT
66M | 110M | 14M
โ†’
๐ŸŽš๏ธ
Pooling
[CLS] | Mean | Pooler
Output: 768-dimensional sentence vector
โ†’
๐Ÿค–
Task Head
Classification
Linear
768 โ†’ 5 classes
โ†’
โœจ
Results
90-98% accuracy

๐Ÿค– BERT Model Variants

๐Ÿง 

BERT-base (bert-base-uncased)

Bidirectional Encoder Representations from Transformers - Base version with 12 transformer layers. Pre-trained on massive text corpora using masked language modeling (MLM) and next sentence prediction (NSP).

Architecture: 12 layers, 768 hidden size, 12 attention heads, 109.5M parameters

โšก

DistilBERT (distilbert-base-uncased)

Distilled version of BERT - 40% smaller, 60% faster, retains 97% of BERT's performance. Uses knowledge distillation during pre-training to compress the model.

Architecture: 6 layers, 768 hidden size, 12 attention heads, 66.4M parameters

๐Ÿš€

TinyBERT (huawei-noah/TinyBERT_General_4L_312D)

Ultra-compact BERT - 87% smaller, 7.5x faster inference. Uses two-stage knowledge distillation (general + task-specific) for extreme compression.

Architecture: 4 layers, 312 hidden size, 12 attention heads, 14.4M parameters

๐ŸŽฏ Pooling Strategies

๐Ÿ“Œ

CLS Token Pooling

Uses the final hidden state of the special [CLS] token (first token). BERT is pre-trained to encode sentence-level information in this token.

Formula: output = last_hidden_state[0, :] (shape: [batch_size, hidden_size])

๐Ÿ”ต

Pooler Output

Takes [CLS] token hidden state and passes through additional dense layer + tanh activation. Provides a refined sentence representation optimized during pre-training.

Formula: output = tanh(dense(last_hidden_state[0, :]))

๐Ÿ“Š

Mean Pooling

Averages hidden states across all tokens (excluding padding). Captures information from entire sequence, not just [CLS] token.

Formula: output = mean(last_hidden_state * attention_mask) / sum(attention_mask)

๐Ÿ“ Python Implementation

Complete code to fine-tune BERT models (BERT-base, DistilBERT, TinyBERT) with different pooling strategies for BBC News classification.

โฌ‡๏ธ Download
โณ Loading code...
Open in Colab

๐Ÿš€ Run in Google Colab

Open this notebook in Google Colab to experiment with BERT fine-tuning for text classification. Includes all 8 configurations (3 models ร— 3 pooling strategies).

โœ“ Free GPU/TPU access
โœ“ Pre-installed libraries
โœ“ Interactive execution
โœ“ Modify & experiment
Open in Colab Open Notebook in Colab

๐Ÿ“š What's included:

  • Dataset loading and preprocessing (BBC News)
  • Tokenization with BERT tokenizers
  • Model fine-tuning for 8 configurations
  • Training with PyTorch/Transformers
  • Evaluation metrics and confusion matrices
  • Pooling strategy comparisons
  • Model performance analysis