🔄 Pipeline

Computer Vision classification pipeline with CNN architectures

🖼️
Image Input
224×224 RGB images
🔧
Preprocessing
Fixed
Normalize, Resize
🧠
CNN Backbone
Choose Below
ResNet50, EfficientNet, MobileNet...
🎯
Classifier Head
Adapted
1000 → 37 classes
Classification Results
37 breed predictions

📸 Input Image Pipeline

Technical aspects of preparing input data for CNN models: tensor formats, data loading, and preprocessing.

📐 Input Tensor Format

CNN models require 4D tensors with specific dimension ordering for efficient computation.

Shape: (N, C, H, W)
N Batch Size Number of images in batch (e.g., 32, 64, 128)
C Channels Color channels (3 for RGB, 1 for grayscale)
H Height Image height in pixels (e.g., 224, 256, 512)
W Width Image width in pixels (e.g., 224, 256, 512)
Architecture-Specific Dimensions:
VGG16
input_shape = (N, 3, 224, 224)
memory_per_image = 3.2MB
ResNet50
input_shape = (N, 3, 224, 224)
memory_per_image = 3.2MB
EfficientNet-B3
input_shape = (N, 3, 300, 300)
memory_per_image = 5.4MB

🔄 Channel Ordering: PyTorch vs TensorFlow

Understanding the difference between channel-first and channel-last memory layouts.

🐍 PyTorch (Channel-First)
# PyTorch tensor format
tensor.shape = (N, C, H, W)
# Example: (32, 3, 224, 224)

# Memory layout: [batch, channels, height, width]
# More efficient for GPU operations
  • Default for PyTorch models
  • More efficient for GPU operations
  • Native format for torchvision
🔧 TensorFlow/Keras (Channel-Last)
# TensorFlow tensor format
tensor.shape = (N, H, W, C)
# Example: (32, 224, 224, 3)

# Memory layout: [batch, height, width, channels]
# More intuitive for image processing
  • Default for TensorFlow models
  • More intuitive for image processing
  • Standard in computer vision
⚠️ Conversion Between Frameworks:
# PyTorch to TensorFlow
tensor_tf = torch.permute(tensor_pt, (0, 2, 3, 1))

# TensorFlow to PyTorch  
tensor_pt = torch.permute(tensor_tf, (0, 3, 1, 2))

# Using torchvision.transforms
from torchvision.transforms import ToTensor, ToPILImage
tensor = ToTensor()(pil_image)  # PIL → PyTorch tensor

📚 PyTorch Dataset & DataLoader

Implementing efficient data loading with PyTorch's Dataset and DataLoader classes.

🎯 Custom Dataset Class
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os

class OxfordPetsDataset(Dataset):
    def __init__(self, image_dir, transform=None):
        self.image_dir = image_dir
        self.transform = transform
        self.samples = self._load_samples()
    
    def _load_samples(self):
        """Load image paths and labels"""
        samples = []
        for breed in os.listdir(self.image_dir):
            breed_dir = os.path.join(self.image_dir, breed)
            for img_name in os.listdir(breed_dir):
                img_path = os.path.join(breed_dir, img_name)
                samples.append((img_path, breed))
        return samples
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        image_path, label = self.samples[idx]
        image = Image.open(image_path).convert('RGB')
        
        if self.transform:
            image = self.transform(image)
            
        return image, label
💡 Key Principles:
  • Lazy Loading: Images loaded only when needed
  • Transform Pipeline: Preprocessing applied on-the-fly
  • Memory Efficient: No pre-loading of entire dataset
🔄 DataLoader Configuration
# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=32,           # Number of samples per batch
    shuffle=True,            # Randomize order each epoch
    num_workers=4,           # Parallel loading processes
    pin_memory=True,         # GPU transfer optimization
    drop_last=True           # Drop incomplete batches
)

# Usage in training loop
for batch_idx, (images, labels) in enumerate(dataloader):
    # images: (batch_size, 3, 224, 224)
    # labels: (batch_size,)
    outputs = model(images)
    loss = criterion(outputs, labels)
batch_size

Number of samples per batch. Affects memory usage and training stability.

batch_size=32 (common choice)
shuffle

Randomize sample order each epoch. Essential for training, not for validation.

shuffle=True (training), shuffle=False (validation)
num_workers

Parallel data loading processes. Usually 4-8 for optimal performance.

num_workers=4 (CPU cores)
pin_memory

Faster GPU transfer. Use True when training on GPU.

pin_memory=True (GPU training)

🔧 Image Preprocessing Pipeline

Step-by-step preprocessing transforms for CNN input preparation.

1
Resize
transforms.Resize(256)

Resize to 256px (maintains aspect ratio)

2
Center Crop
transforms.CenterCrop(224)

Crop to 224×224 from center

3
To Tensor
transforms.ToTensor()

Convert PIL to tensor, scale [0,255] → [0,1]

4
Normalize
transforms.Normalize(
    mean=[0.485, 0.456, 0.406], 
    std=[0.229, 0.224, 0.225]
)

ImageNet statistics normalization

Complete Preprocessing Pipeline:
from torchvision import transforms

# Training transforms (with augmentation)
train_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

# Validation transforms (no augmentation)
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

📖 Additional Topics

🎯 Data Augmentation

Techniques to increase dataset diversity and prevent overfitting.

# Common augmentation techniques
transforms.RandomHorizontalFlip(p=0.5)
transforms.RandomRotation(degrees=15)
transforms.ColorJitter(brightness=0.2, contrast=0.2)
transforms.RandomResizedCrop(224, scale=(0.8, 1.0))
⚡ Memory Optimization

Best practices for efficient memory usage during training.

  • Use appropriate batch sizes (32-64)
  • Enable pin_memory for GPU training
  • Use DataLoader with num_workers
  • Monitor GPU memory usage
🔗 PyTorch Documentation

Official PyTorch resources for deep learning:

🔧 Preprocessing Pipeline

Essential image preprocessing steps for CNN input preparation: resizing, data augmentation, and normalization.

📏 Image Resizing & Standardization

Convert images of different sizes to uniform dimensions for batch processing.

🎯 Objective

Ensure all images have the same dimensions to create consistent tensors with shape (N, C, H, W) for efficient batch processing.

1
Resize to Standard Size
# Resize to 256px (maintains aspect ratio)
transforms.Resize(256)

Scale image to 256px while preserving aspect ratio

2
Center Crop to Target Size
# Crop to exact 224x224 from center
transforms.CenterCrop(224)

Extract 224×224 region from center of resized image

3
Convert to Tensor
# Convert PIL to tensor, scale [0,255] → [0,1]
transforms.ToTensor()

Convert PIL image to PyTorch tensor with normalized values

📊 Result:
# Input: Various sized images
# Output: Consistent tensor shape
tensor.shape = (batch_size, 3, 224, 224)
# Example: (32, 3, 224, 224) for batch of 32 images

🎨 Data Augmentation

Techniques to artificially increase dataset diversity and improve model generalization.

🎯 Objective
  • Increase Dataset Size: Generate more training samples from existing data
  • Improve Generalization: Help model learn robust features that work across variations
  • Reduce Overfitting: Prevent model from memorizing specific image characteristics
  • Simulate Real-world Variations: Account for lighting, rotation, scale differences
🔧 Implementation Methods
Geometric Transforms
# Random crop with scale variation
transforms.RandomResizedCrop(224, scale=(0.8, 1.0))

# Random horizontal flip
transforms.RandomHorizontalFlip(p=0.5)

# Random rotation
transforms.RandomRotation(degrees=15)
Color Transforms
# Color jitter for brightness/contrast
transforms.ColorJitter(
    brightness=0.2, 
    contrast=0.2, 
    saturation=0.2, 
    hue=0.1
)

# Random grayscale conversion
transforms.RandomGrayscale(p=0.1)
⚠️ Important Notes:
  • Training Only: Apply augmentation ONLY to training set
  • No Validation/Test: Keep validation and test sets unchanged
  • Consistent Evaluation: Ensure fair comparison across models
  • Realistic Augmentation: Use transformations that make sense for your domain
Complete Augmentation Pipeline:
# Training transforms (WITH augmentation)
train_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

# Validation transforms (NO augmentation)
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

📊 Data Normalization

Standardize pixel values to improve training stability and convergence.

🎯 Objective
  • Stabilize Training: Prevent gradient explosion/vanishing
  • Faster Convergence: Help optimizer find optimal solution
  • Feature Standardization: Ensure all features have similar scales
  • Transfer Learning Compatibility: Match pre-trained model expectations
🔧 Implementation
Normalization Formula:
normalized_pixel = (pixel - mean) / std

Where pixel values are in range [0, 1] after ToTensor()

📋 Model-Specific Statistics:
ImageNet Standard (ResNet, VGG, DenseNet)
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]

Most common for ImageNet pre-trained models

EfficientNet (timm)
# EfficientNet-B0 to B7
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]

# EfficientNetV2 (different stats)
mean = [0.5, 0.5, 0.5]
std = [0.5, 0.5, 0.5]

EfficientNetV2 uses different normalization

Vision Transformer (ViT)
# ViT models in timm
mean = [0.5, 0.5, 0.5]
std = [0.5, 0.5, 0.5]

ViT models use different normalization

ConvNeXt (timm)
# ConvNeXt models
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]

ConvNeXt uses ImageNet standard

MobileNetV3 (timm)
# MobileNetV3 models
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]

MobileNetV3 uses ImageNet standard

Custom Dataset
# Calculate from your training set
mean = [0.5, 0.5, 0.5]  # Example
std = [0.5, 0.5, 0.5]   # Example

Calculate from your specific dataset

🔧 Using timm Library:
import timm

# Get model with correct transforms
model = timm.create_model('efficientnet_b0', pretrained=True)

# Get model-specific transforms
transforms = timm.data.create_transform(
    input_size=224,
    is_training=True,
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
)

# Or use model's default transforms
transforms = timm.data.create_transform(
    input_size=224,
    is_training=True,
    model=model
)

Note: timm automatically handles model-specific normalization when using timm.data.create_transform()

📊 How to Calculate Mean & Std:
1. From Training Set Only
# Calculate statistics from training set
def calculate_dataset_stats(dataset):
    loader = DataLoader(dataset, batch_size=1, shuffle=False)
    mean = torch.zeros(3)
    std = torch.zeros(3)
    
    for images, _ in loader:
        mean += images.mean(dim=(2, 3))
        std += images.std(dim=(2, 3))
    
    mean /= len(loader)
    std /= len(loader)
    
    return mean, std
2. Not from Entire Dataset
  • Training Set Only: Calculate from training images
  • Sample Size: Use subset if dataset is too large
  • Consistency: Use same stats for all splits
🔧 Complete Normalization Example:
# ImageNet normalization (most common)
normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406], 
    std=[0.229, 0.224, 0.225]
)

# Apply to tensor
normalized_tensor = normalize(tensor)

# Result: pixel values roughly in range [-2, 2]
# Mean ≈ 0, Std ≈ 1 for each channel

💡 Best Practices

🎯 Training vs Validation
  • Use augmentation for training only
  • Keep validation set unchanged
  • Apply same normalization to both
⚡ Performance
  • Use appropriate batch sizes
  • Enable pin_memory for GPU
  • Use num_workers for parallel loading
🔧 Implementation
  • Use transforms.Compose() for chaining
  • Test transforms on sample images
  • Monitor memory usage during training

🧠 CNN Backbone Architecture

Feature extraction backbones and transfer learning strategies for CNN classification.

🔍 Feature Extraction and Transformation

CNN backbones transform input images into high-level features suitable for classification.

📥 Input Processing
Input Tensor
# Input: Batch of normalized images
input_tensor.shape = (N, C, H, W)
# Example: (32, 3, 224, 224)
# N = batch_size, C = 3 (RGB), H = W = 224
CNN Backbone Processing
# Convolutional layers + Non-linear activations
# Linear transformations (convolutions) + Non-linear functions (ReLU, etc.)
# Extract hierarchical features from images
Output Features
# Output: Feature maps for each image
output_tensor.shape = (N, C_o, H_o, W_o)
# Example: (32, 2048, 7, 7) for ResNet50
# Each image has C_o × H_o × W_o features
# 2048 × 7 × 7 = 100,352 features per image
🔧 Convolutional Layer Details
ResNet50 Example
# First convolutional layer
conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
# Parameters: 3 × 64 × 7 × 7 = 9,408 parameters

# Bottleneck block
conv2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
# Parameters: 64 × 64 × 3 × 3 = 36,864 parameters

# Total ResNet50 parameters: ~25.6M
# Most parameters in convolutional layers
EfficientNet-B0 Example
# Depthwise separable convolution
depthwise = nn.Conv2d(32, 32, kernel_size=3, groups=32)
# Parameters: 32 × 3 × 3 = 288 parameters

pointwise = nn.Conv2d(32, 64, kernel_size=1)
# Parameters: 32 × 64 × 1 × 1 = 2,048 parameters

# Total EfficientNet-B0 parameters: ~5.3M
# More efficient than standard convolutions
📊 Key Points:
  • Learnable Parameters: Mostly in convolutional layers
  • Feature Maps: Each layer extracts different level features
  • Hierarchical Features: Low-level → High-level features
  • Translation Invariance: Convolutions are translation invariant

🏆 Popular Classification Models

Well-known CNN architectures for image classification tasks.

ResNet Family
  • ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
  • Residual connections for deep networks
  • Parameters: 11.7M (ResNet18) to 60.2M (ResNet152)
EfficientNet Family
  • EfficientNet-B0 to B7, EfficientNetV2
  • Compound scaling for efficiency
  • Parameters: 5.3M (B0) to 66M (B7)
MobileNet Family
  • MobileNetV2, MobileNetV3, EfficientNet-Lite
  • Mobile-optimized architectures
  • Parameters: 3.4M (MobileNetV2) to 5.5M (MobileNetV3)
Vision Transformers
  • ViT-Base, ViT-Large, Swin Transformer
  • Transformer-based architectures
  • Parameters: 86M (ViT-Base) to 307M (ViT-Large)

🔄 Transfer Learning

Leveraging pre-trained models for specific classification tasks.

🎯 Design Principles
a) Reuse Pre-trained Features

Leverage feature extractors pre-trained on large datasets (ImageNet)

# Load pre-trained backbone
backbone = timm.create_model('resnet50', pretrained=True)
# Features learned from 1.2M ImageNet images
b) Replace Classifier Head

Adapt output layer to match target number of classes

# Original: 1000 ImageNet classes
# Target: 37 Oxford Pets classes
classifier = nn.Linear(2048, 37)  # ResNet50 features
✅ Advantages of Transfer Learning:
  • Faster Training: Pre-trained features accelerate convergence
  • Better Performance: Leverage knowledge from large datasets
  • Less Data Required: Work well with smaller datasets
  • Computational Efficiency: Avoid training from scratch
  • Proven Features: Use battle-tested feature extractors
🔧 Training Strategy Selection:
c.1 Freeze Backbone
# Freeze CNN backbone
for param in backbone.parameters():
    param.requires_grad = False

# Only train classifier head
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)

Use case: Small datasets, quick prototyping

c.2 Partial Freezing
# Freeze early layers, unfreeze later layers
for param in backbone.layer1.parameters():
    param.requires_grad = False
for param in backbone.layer4.parameters():
    param.requires_grad = True

Use case: Balanced approach

c.3 Full Fine-tuning
# Train entire model with pre-trained weights
for param in backbone.parameters():
    param.requires_grad = True

# Use lower learning rate for backbone
optimizer = torch.optim.Adam([
    {'params': backbone.parameters(), 'lr': 0.0001},
    {'params': classifier.parameters(), 'lr': 0.001}
])

Use case: Large datasets, best performance

c.4 Train from Scratch
# No pre-trained weights (rare)
model = timm.create_model('resnet50', pretrained=False)
# Train everything from scratch

Use case: Very large datasets, research

💻 Transfer Learning Implementation

Practical code examples for creating transfer learning models.

🔧 Strategy c.1: Freeze Backbone (TorchVision)
import torch
import torch.nn as nn
import torchvision.models as models

# Load pre-trained ResNet50
backbone = models.resnet50(pretrained=True)

# Freeze backbone parameters
for param in backbone.parameters():
    param.requires_grad = False

# Replace classifier head
num_classes = 37  # Oxford Pets classes
backbone.fc = nn.Linear(backbone.fc.in_features, num_classes)

# Only train classifier
optimizer = torch.optim.Adam(backbone.fc.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    for images, labels in dataloader:
        optimizer.zero_grad()
        outputs = backbone(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
🔧 Strategy c.1: Freeze Backbone (timm)
import timm
import torch.nn as nn

# Load pre-trained EfficientNet
backbone = timm.create_model('efficientnet_b0', pretrained=True)

# Freeze backbone parameters
for param in backbone.parameters():
    param.requires_grad = False

# Replace classifier head
num_classes = 37
backbone.classifier = nn.Linear(backbone.classifier.in_features, num_classes)

# Only train classifier
optimizer = torch.optim.Adam(backbone.classifier.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    for images, labels in dataloader:
        optimizer.zero_grad()
        outputs = backbone(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
📊 TorchVision vs timm:
TorchVision
  • Official PyTorch models
  • Standard architectures
  • Good documentation
  • Limited model variety
timm
  • Extensive model zoo
  • Latest architectures
  • Flexible configuration
  • Research-focused

🎯 Classifier Head

Adapting pre-trained models for specific classification tasks.

🎯 Classifier Head Overview

The final component that converts CNN features into class predictions.

📋 Definition & Purpose
What is a Classifier Head?

The classifier head is the final component of a CNN that transforms extracted features into class predictions. It takes high-level features from the CNN backbone and outputs probability scores for each class.

Input: Features from Backbone
# Features from CNN backbone
features.shape = (N, C, H, W)
# Example: (32, 2048, 7, 7) from ResNet50
Processing: Classifier Head
# Global Average Pooling + Linear Layer
# (N, C, H, W) → (N, C) → (N, num_classes)
Output: Class Predictions
# Class probabilities
predictions.shape = (N, num_classes)
# Example: (32, 37) for Oxford Pets

🔧 Architecture Components

Key components that make up a classifier head.

🌊 Global Average Pooling (GAP)

Purpose: Reduces spatial dimensions from (N, C, H, W) to (N, C, 1, 1)

Advantages: Reduces parameters, prevents overfitting, translation invariant

# Global Average Pooling
gap = nn.AdaptiveAvgPool2d(1)
# Input: (32, 2048, 7, 7)
# Output: (32, 2048, 1, 1)
🔗 Fully Connected Layers

Purpose: Maps features to class scores

Input: Number of features from backbone

Output: Number of classes in dataset

# Linear layer for classification
classifier = nn.Linear(2048, 37)
# Input: 2048 features (ResNet50)
# Output: 37 classes (Oxford Pets)
⚡ Activation Functions

Softmax: Converts scores to probabilities

Sigmoid: For multi-label classification

ReLU: For hidden layers

# Softmax for single-label classification
probabilities = F.softmax(logits, dim=1)
# Sigmoid for multi-label classification
probabilities = torch.sigmoid(logits)

🎨 Design Strategies

Different approaches to designing classifier heads.

🔧 Simple Classifier

Architecture: GAP → Linear layer

Use case: Small datasets, quick prototyping

Parameters: Minimal parameters

# Simple classifier head
class SimpleClassifier(nn.Module):
    def __init__(self, num_features, num_classes):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Linear(num_features, num_classes)
    
    def forward(self, x):
        x = self.gap(x).flatten(1)
        return self.classifier(x)

# Usage
classifier = SimpleClassifier(2048, 37)
🏗️ Multi-layer Classifier

Architecture: GAP → Linear → ReLU → Dropout → Linear

Use case: Complex datasets, better performance

Parameters: More parameters, more capacity

# Multi-layer classifier head
class MultiLayerClassifier(nn.Module):
    def __init__(self, num_features, num_classes, hidden_dim=512):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Sequential(
            nn.Linear(num_features, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(hidden_dim, num_classes)
        )
    
    def forward(self, x):
        x = self.gap(x).flatten(1)
        return self.classifier(x)

# Usage
classifier = MultiLayerClassifier(2048, 37, hidden_dim=512)
🎯 Custom Classifier

Architecture: Custom design for specific tasks

Use case: Specialized requirements

Parameters: Flexible design

# Custom classifier head
class CustomClassifier(nn.Module):
    def __init__(self, num_features, num_classes):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.feature_extractor = nn.Sequential(
            nn.Linear(num_features, 1024),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Dropout(0.3)
        )
        self.classifier = nn.Linear(512, num_classes)
    
    def forward(self, x):
        x = self.gap(x).flatten(1)
        features = self.feature_extractor(x)
        return self.classifier(features)

# Usage
classifier = CustomClassifier(2048, 37)

💻 Implementation Examples

Practical code examples for creating classifier heads.

🔧 TorchVision Models
import torch
import torch.nn as nn
import torchvision.models as models

# Load pre-trained ResNet50
backbone = models.resnet50(pretrained=True)

# Replace classifier head
num_classes = 37  # Oxford Pets classes
backbone.fc = nn.Linear(backbone.fc.in_features, num_classes)

# Check the new classifier
print(f"Classifier input features: {backbone.fc.in_features}")
print(f"Classifier output classes: {backbone.fc.out_features}")

# Usage
model = backbone
outputs = model(images)  # Shape: (batch_size, 37)
🔧 timm Models
import timm
import torch.nn as nn

# Create model with custom number of classes
model = timm.create_model('efficientnet_b0', 
                         pretrained=True, 
                         num_classes=37)

# Check model structure
print(f"Model: {model}")
print(f"Classifier: {model.classifier}")

# Usage
outputs = model(images)  # Shape: (batch_size, 37)
🔧 Custom Classifier
import torch
import torch.nn as nn

# Create custom classifier head
class CustomClassifierHead(nn.Module):
    def __init__(self, backbone_features, num_classes):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Sequential(
            nn.Linear(backbone_features, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )
    
    def forward(self, x):
        x = self.gap(x).flatten(1)
        return self.classifier(x)

# Usage with ResNet50 backbone
backbone = models.resnet50(pretrained=True)
# Remove original classifier
backbone.fc = nn.Identity()

# Add custom classifier
classifier = CustomClassifierHead(2048, 37)

# Complete model
class CompleteModel(nn.Module):
    def __init__(self, backbone, classifier):
        super().__init__()
        self.backbone = backbone
        self.classifier = classifier
    
    def forward(self, x):
        features = self.backbone(x)
        return self.classifier(features)

model = CompleteModel(backbone, classifier)

🔧 Training Strategies

Different approaches to training classifier heads.

🔒 Freeze Backbone

Strategy: Only train classifier head

Use case: Small datasets, quick training

Advantages: Fast training, less overfitting

# Freeze backbone parameters
for param in backbone.parameters():
    param.requires_grad = False

# Only train classifier
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    for images, labels in dataloader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
🔄 Fine-tune Backbone

Strategy: Train entire model

Use case: Large datasets, best performance

Advantages: Better performance, full model adaptation

# Unfreeze all parameters
for param in model.parameters():
    param.requires_grad = True

# Use different learning rates
optimizer = torch.optim.Adam([
    {'params': backbone.parameters(), 'lr': 0.0001},
    {'params': classifier.parameters(), 'lr': 0.001}
])

# Training loop
for epoch in range(num_epochs):
    for images, labels in dataloader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
📈 Progressive Unfreezing

Strategy: Unfreeze layers gradually

Use case: Balanced approach

Advantages: Stable training, good performance

# Start with frozen backbone
for param in backbone.parameters():
    param.requires_grad = False

# Train classifier first
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)
# ... train for several epochs ...

# Unfreeze later layers
for param in backbone.layer4.parameters():
    param.requires_grad = True

# Continue training
optimizer = torch.optim.Adam([
    {'params': backbone.layer4.parameters(), 'lr': 0.0001},
    {'params': classifier.parameters(), 'lr': 0.001}
])

💡 Best Practices

Guidelines for designing and training classifier heads.

🎯 Design Principles
  • Start Simple: Begin with basic classifier
  • Add Complexity: Gradually increase capacity
  • Monitor Overfitting: Watch for overfitting signs
  • Regularization: Use dropout, weight decay
⚡ Implementation Tips
  • Input Size: Match backbone output
  • Output Size: Match number of classes
  • Activation: Use appropriate functions
  • Initialization: Proper weight initialization
🔍 Debugging
  • Check Shapes: Verify input/output shapes
  • Monitor Gradients: Check gradient flow
  • Visualize: Plot training curves
  • Test: Validate on validation set

✨ Classification Results

Key findings and insights from the CNN classification experiments.

📊 Quick Summary

High-level overview of model performance and key findings.

🏆 Best Performing Model
VGG16 (NoAug)
89.1% Accuracy
88.5% F1-Score

VGG16 without data augmentation achieved the highest performance on Oxford Pets dataset.

📈 Performance Range
Accuracy: 77.1% - 89.1%
F1-Score: 76.7% - 88.5%
Top-5 Accuracy: 95.7% - 99.3%
⚡ Fastest Model
MobileNetV3 Large 100
Mobile Optimized
81.1% Accuracy

MobileNetV3 provides good balance between accuracy and speed for mobile deployment.

💡 Key Insights

Main findings and lessons learned from the experiments.

✅ What Worked
  • Transfer Learning: Pre-trained models significantly outperformed training from scratch
  • Data Augmentation: Mixed results - helped some models, hurt others
  • Architecture Choice: VGG16 and ResNet50 performed consistently well
  • Feature Extraction: CNN backbones learned effective features for pet classification
❌ What Didn't Work
  • EfficientNet B3: Underperformed despite modern architecture
  • Aggressive Augmentation: Some models overfitted with heavy augmentation
  • Small Models: MobileNetV3 struggled with complex pet features
  • Training from Scratch: Much slower convergence than transfer learning
🎯 Key Takeaways
  • Transfer Learning is Essential: Pre-trained features are crucial for good performance
  • Architecture Matters: VGG16's simplicity worked better than complex modern architectures
  • Data Augmentation is Model-Specific: Not all models benefit equally
  • Mobile Deployment: MobileNetV3 provides good speed/accuracy trade-off

🔍 Detailed Analysis

For comprehensive performance analysis, explore the sections below.

📊 Model Comparison

Detailed performance metrics, confusion matrices, and model rankings.

🎨 Feature Maps Explorer

Visualize CNN layer outputs and understand what models learn.

⚙️ Configuration

Model architectures, training parameters, and data configurations.

🚀 Recommendations

Practical guidance for model selection and deployment.

🎯 For Best Accuracy
VGG16 (NoAug)
Highest accuracy (89.1%) with stable performance
Use for: Research, high-accuracy requirements
⚡ For Mobile Deployment
MobileNetV3 Large 100
Good accuracy (81.1%) with mobile optimization
Use for: Mobile apps, edge devices
🔄 For Balanced Performance
ResNet50 (NoAug)
Good accuracy (86.9%) with proven architecture
Use for: General purpose, production systems

📊 Model Comparison

Compare performance metrics across CNN models

Model Precision ↑ Recall ↑ F1-Score ↑ Accuracy ↑ Top-5 Acc ↑
Model Training Time ↓ (min) Inference Time ↓ (ms) FLOPs ↓ (B) Parameters ↓ (M) Model Size ↓ (MB)
📝 Note: Training/Inference times are calculated as total time for training and average time for inference per sample.

📊 Classification Metrics Formulas

🎯 Basic Classification Metrics

Accuracy
\[\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}\]

Overall correctness of the model across all classes

Precision
\[\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\]

Proportion of positive predictions that are actually correct

Recall
\[\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\]

Proportion of actual positives that are correctly identified

F1-Score
\[\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}\]

Harmonic mean of precision and recall

Top-5 Accuracy
\[\text{Top-5 Accuracy} = \frac{\text{Number of correct predictions in top-5}}{\text{Total predictions}}\]

Accuracy when considering top-5 predictions

📈 Averaging Methods

Macro Average
\[\text{Macro-Precision} = \frac{1}{n} \sum_{i=1}^{n} \text{Precision}_i\] \[\text{Macro-Recall} = \frac{1}{n} \sum_{i=1}^{n} \text{Recall}_i\] \[\text{Macro-F1} = \frac{1}{n} \sum_{i=1}^{n} \text{F1}_i\]

Simple average across all classes (treats all classes equally)

Weighted Average
\[\text{Weighted-Precision} = \frac{\sum_{i=1}^{n} \text{Precision}_i \times \text{Support}_i}{\sum_{i=1}^{n} \text{Support}_i}\] \[\text{Weighted-Recall} = \frac{\sum_{i=1}^{n} \text{Recall}_i \times \text{Support}_i}{\sum_{i=1}^{n} \text{Support}_i}\] \[\text{Weighted-F1} = \frac{\sum_{i=1}^{n} \text{F1}_i \times \text{Support}_i}{\sum_{i=1}^{n} \text{Support}_i}\]

Average weighted by the number of true instances for each class

⚡ Resource Metrics

FLOPs (Floating Point Operations)
\[\text{FLOPs} = \sum_{l=1}^{L} \text{FLOPs}_l\] \[\text{FLOPs}_l = 2 \times \text{MACs}_l\]

Total floating point operations for forward pass

📊 Model Visualizations

Interactive feature maps and model analysis visualizations.

🔍 Feature Maps Explorer

Explore how different CNN layers extract features from images.

Select a model, layer, and sample to view feature maps.

📈 Learning Curves

Training and validation performance over epochs.

🏗️ Model Architecture

Visual representation of CNN architectures and layer connections.

⚙️ Model Configurations

Training and evaluation configurations for all models in the comparison table.

🎯 Training Configuration

Transfer Learning Strategy
Mode: Transfer Learning (Freeze CNN Backbone)
Frozen Layers: All CNN backbone layers
Trainable Layers: Classifier head only (final FC layer)
Pre-trained Weights: ImageNet-1K
Hyperparameters
Optimizer: Adam
Learning Rate: 0.001 (1e-3)
Batch Size: 32
Max Epochs: 40
Early Stopping: Patience = 5 epochs, Min Delta = 0.001
Loss Function: Cross Entropy Loss

📊 Data Configuration

Dataset Split
Training Set: 3,680 images
Validation Set: 3,669 images
Test Set: 3,669 images
Number of Classes: 37 breeds (12 cats + 25 dogs)
Image Preprocessing
Input Size: 224×224×3 (RGB)
Normalization (VGG16, ResNet50): Mean=[0.485, 0.456, 0.406], Std=[0.229, 0.224, 0.225]
Normalization (EfficientNet, MobileNetV3, Xception): Mean=[0.5, 0.5, 0.5], Std=[0.5, 0.5, 0.5]
Data Augmentation
No Augmentation: Resize(256) → CenterCrop(224) → ToTensor → Normalize
With Augmentation: RandomResizedCrop(224) → RandomHorizontalFlip(p=0.5) → ColorJitter → ToTensor → Normalize

🏗️ Model Architectures

VGG16
Type: Dense CNN
Depth: 16 layers (13 conv + 3 FC)
Parameters: 134.41M
Best for: Server deployment, high accuracy
ResNet50
Type: Residual Network
Depth: 50 layers with skip connections
Parameters: 23.58M
Best for: Balanced performance and efficiency
MobileNetV3 Large
Type: Depthwise Separable CNN
Depth: Inverted residual blocks with SE modules
Parameters: 4.25M
Best for: Mobile and edge devices
EfficientNet-B3
Type: Compound Scaled CNN
Depth: MBConv blocks with compound scaling
Parameters: 10.75M
Best for: Balanced accuracy and efficiency
Xception
Type: Extreme Inception
Depth: Depthwise separable convolutions
Parameters: 20.88M
Best for: High accuracy with moderate size

🧪 Evaluation Configuration

Metrics & Hardware
Metrics: Accuracy, Precision, Recall, F1-Score, Top-5 Accuracy
Averaging Method: Macro average
Training Hardware: NVIDIA GPU (CUDA enabled)
Inference Benchmark: Single image, GPU inference, average over 100 runs

📊 Confusion Matrix

Select a model from the comparison table above to view its confusion matrix

Click on a model row in the comparison table to view its confusion matrix