State-of-the-art transformer models with multiple pooling strategies
Bidirectional Encoder Representations from Transformers - Base version with 12 transformer layers. Pre-trained on massive text corpora using masked language modeling (MLM) and next sentence prediction (NSP).
Architecture: 12 layers, 768 hidden size, 12 attention heads, 109.5M parameters
Distilled version of BERT - 40% smaller, 60% faster, retains 97% of BERT's performance. Uses knowledge distillation during pre-training to compress the model.
Architecture: 6 layers, 768 hidden size, 12 attention heads, 66.4M parameters
Ultra-compact BERT - 87% smaller, 7.5x faster inference. Uses two-stage knowledge distillation (general + task-specific) for extreme compression.
Architecture: 4 layers, 312 hidden size, 12 attention heads, 14.4M parameters
Uses the final hidden state of the special [CLS] token (first token). BERT is pre-trained to encode sentence-level information in this token.
Formula: output = last_hidden_state[0, :] (shape: [batch_size, hidden_size])
Takes [CLS] token hidden state and passes through additional dense layer + tanh activation. Provides a refined sentence representation optimized during pre-training.
Formula: output = tanh(dense(last_hidden_state[0, :]))
Averages hidden states across all tokens (excluding padding). Captures information from entire sequence, not just [CLS] token.
Formula: output = mean(last_hidden_state * attention_mask) / sum(attention_mask)
Complete code to fine-tune BERT models (BERT-base, DistilBERT, TinyBERT) with different pooling strategies for BBC News classification.
Open this notebook in Google Colab to experiment with BERT fine-tuning for text classification. Includes all 8 configurations (3 models ร 3 pooling strategies).