Computer Vision classification pipeline with CNN architectures
Technical aspects of preparing input data for CNN models: tensor formats, data loading, and preprocessing.
CNN models require 4D tensors with specific dimension ordering for efficient computation.
input_shape = (N, 3, 224, 224)
memory_per_image = 3.2MB
input_shape = (N, 3, 224, 224)
memory_per_image = 3.2MB
input_shape = (N, 3, 300, 300)
memory_per_image = 5.4MB
Understanding the difference between channel-first and channel-last memory layouts.
# PyTorch tensor format
tensor.shape = (N, C, H, W)
# Example: (32, 3, 224, 224)
# Memory layout: [batch, channels, height, width]
# More efficient for GPU operations
# TensorFlow tensor format
tensor.shape = (N, H, W, C)
# Example: (32, 224, 224, 3)
# Memory layout: [batch, height, width, channels]
# More intuitive for image processing
# PyTorch to TensorFlow
tensor_tf = torch.permute(tensor_pt, (0, 2, 3, 1))
# TensorFlow to PyTorch
tensor_pt = torch.permute(tensor_tf, (0, 3, 1, 2))
# Using torchvision.transforms
from torchvision.transforms import ToTensor, ToPILImage
tensor = ToTensor()(pil_image) # PIL → PyTorch tensor
Implementing efficient data loading with PyTorch's Dataset and DataLoader classes.
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
class OxfordPetsDataset(Dataset):
def __init__(self, image_dir, transform=None):
self.image_dir = image_dir
self.transform = transform
self.samples = self._load_samples()
def _load_samples(self):
"""Load image paths and labels"""
samples = []
for breed in os.listdir(self.image_dir):
breed_dir = os.path.join(self.image_dir, breed)
for img_name in os.listdir(breed_dir):
img_path = os.path.join(breed_dir, img_name)
samples.append((img_path, breed))
return samples
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
image_path, label = self.samples[idx]
image = Image.open(image_path).convert('RGB')
if self.transform:
image = self.transform(image)
return image, label
# Create DataLoader
dataloader = DataLoader(
dataset,
batch_size=32, # Number of samples per batch
shuffle=True, # Randomize order each epoch
num_workers=4, # Parallel loading processes
pin_memory=True, # GPU transfer optimization
drop_last=True # Drop incomplete batches
)
# Usage in training loop
for batch_idx, (images, labels) in enumerate(dataloader):
# images: (batch_size, 3, 224, 224)
# labels: (batch_size,)
outputs = model(images)
loss = criterion(outputs, labels)
Number of samples per batch. Affects memory usage and training stability.
batch_size=32 (common choice)
Randomize sample order each epoch. Essential for training, not for validation.
shuffle=True (training), shuffle=False (validation)
Parallel data loading processes. Usually 4-8 for optimal performance.
num_workers=4 (CPU cores)
Faster GPU transfer. Use True when training on GPU.
pin_memory=True (GPU training)
Step-by-step preprocessing transforms for CNN input preparation.
transforms.Resize(256)
Resize to 256px (maintains aspect ratio)
transforms.CenterCrop(224)
Crop to 224×224 from center
transforms.ToTensor()
Convert PIL to tensor, scale [0,255] → [0,1]
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
ImageNet statistics normalization
from torchvision import transforms
# Training transforms (with augmentation)
train_transform = transforms.Compose([
transforms.Resize(256),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Validation transforms (no augmentation)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
Techniques to increase dataset diversity and prevent overfitting.
# Common augmentation techniques
transforms.RandomHorizontalFlip(p=0.5)
transforms.RandomRotation(degrees=15)
transforms.ColorJitter(brightness=0.2, contrast=0.2)
transforms.RandomResizedCrop(224, scale=(0.8, 1.0))
Best practices for efficient memory usage during training.
Official PyTorch resources for deep learning:
Essential image preprocessing steps for CNN input preparation: resizing, data augmentation, and normalization.
Convert images of different sizes to uniform dimensions for batch processing.
Ensure all images have the same dimensions to create consistent tensors with shape (N, C, H, W) for efficient batch processing.
# Resize to 256px (maintains aspect ratio)
transforms.Resize(256)
Scale image to 256px while preserving aspect ratio
# Crop to exact 224x224 from center
transforms.CenterCrop(224)
Extract 224×224 region from center of resized image
# Convert PIL to tensor, scale [0,255] → [0,1]
transforms.ToTensor()
Convert PIL image to PyTorch tensor with normalized values
# Input: Various sized images
# Output: Consistent tensor shape
tensor.shape = (batch_size, 3, 224, 224)
# Example: (32, 3, 224, 224) for batch of 32 images
Techniques to artificially increase dataset diversity and improve model generalization.
# Random crop with scale variation
transforms.RandomResizedCrop(224, scale=(0.8, 1.0))
# Random horizontal flip
transforms.RandomHorizontalFlip(p=0.5)
# Random rotation
transforms.RandomRotation(degrees=15)
# Color jitter for brightness/contrast
transforms.ColorJitter(
brightness=0.2,
contrast=0.2,
saturation=0.2,
hue=0.1
)
# Random grayscale conversion
transforms.RandomGrayscale(p=0.1)
# Training transforms (WITH augmentation)
train_transform = transforms.Compose([
transforms.Resize(256),
transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Validation transforms (NO augmentation)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
Standardize pixel values to improve training stability and convergence.
normalized_pixel = (pixel - mean) / std
Where pixel values are in range [0, 1] after ToTensor()
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
Most common for ImageNet pre-trained models
# EfficientNet-B0 to B7
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
# EfficientNetV2 (different stats)
mean = [0.5, 0.5, 0.5]
std = [0.5, 0.5, 0.5]
EfficientNetV2 uses different normalization
# ViT models in timm
mean = [0.5, 0.5, 0.5]
std = [0.5, 0.5, 0.5]
ViT models use different normalization
# ConvNeXt models
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
ConvNeXt uses ImageNet standard
# MobileNetV3 models
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
MobileNetV3 uses ImageNet standard
# Calculate from your training set
mean = [0.5, 0.5, 0.5] # Example
std = [0.5, 0.5, 0.5] # Example
Calculate from your specific dataset
import timm
# Get model with correct transforms
model = timm.create_model('efficientnet_b0', pretrained=True)
# Get model-specific transforms
transforms = timm.data.create_transform(
input_size=224,
is_training=True,
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
# Or use model's default transforms
transforms = timm.data.create_transform(
input_size=224,
is_training=True,
model=model
)
Note: timm automatically handles model-specific normalization when using timm.data.create_transform()
# Calculate statistics from training set
def calculate_dataset_stats(dataset):
loader = DataLoader(dataset, batch_size=1, shuffle=False)
mean = torch.zeros(3)
std = torch.zeros(3)
for images, _ in loader:
mean += images.mean(dim=(2, 3))
std += images.std(dim=(2, 3))
mean /= len(loader)
std /= len(loader)
return mean, std
# ImageNet normalization (most common)
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
# Apply to tensor
normalized_tensor = normalize(tensor)
# Result: pixel values roughly in range [-2, 2]
# Mean ≈ 0, Std ≈ 1 for each channel
Feature extraction backbones and transfer learning strategies for CNN classification.
CNN backbones transform input images into high-level features suitable for classification.
# Input: Batch of normalized images
input_tensor.shape = (N, C, H, W)
# Example: (32, 3, 224, 224)
# N = batch_size, C = 3 (RGB), H = W = 224
# Convolutional layers + Non-linear activations
# Linear transformations (convolutions) + Non-linear functions (ReLU, etc.)
# Extract hierarchical features from images
# Output: Feature maps for each image
output_tensor.shape = (N, C_o, H_o, W_o)
# Example: (32, 2048, 7, 7) for ResNet50
# Each image has C_o × H_o × W_o features
# 2048 × 7 × 7 = 100,352 features per image
# First convolutional layer
conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
# Parameters: 3 × 64 × 7 × 7 = 9,408 parameters
# Bottleneck block
conv2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
# Parameters: 64 × 64 × 3 × 3 = 36,864 parameters
# Total ResNet50 parameters: ~25.6M
# Most parameters in convolutional layers
# Depthwise separable convolution
depthwise = nn.Conv2d(32, 32, kernel_size=3, groups=32)
# Parameters: 32 × 3 × 3 = 288 parameters
pointwise = nn.Conv2d(32, 64, kernel_size=1)
# Parameters: 32 × 64 × 1 × 1 = 2,048 parameters
# Total EfficientNet-B0 parameters: ~5.3M
# More efficient than standard convolutions
Well-known CNN architectures for image classification tasks.
Leveraging pre-trained models for specific classification tasks.
Leverage feature extractors pre-trained on large datasets (ImageNet)
# Load pre-trained backbone
backbone = timm.create_model('resnet50', pretrained=True)
# Features learned from 1.2M ImageNet images
Adapt output layer to match target number of classes
# Original: 1000 ImageNet classes
# Target: 37 Oxford Pets classes
classifier = nn.Linear(2048, 37) # ResNet50 features
# Freeze CNN backbone
for param in backbone.parameters():
param.requires_grad = False
# Only train classifier head
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)
Use case: Small datasets, quick prototyping
# Freeze early layers, unfreeze later layers
for param in backbone.layer1.parameters():
param.requires_grad = False
for param in backbone.layer4.parameters():
param.requires_grad = True
Use case: Balanced approach
# Train entire model with pre-trained weights
for param in backbone.parameters():
param.requires_grad = True
# Use lower learning rate for backbone
optimizer = torch.optim.Adam([
{'params': backbone.parameters(), 'lr': 0.0001},
{'params': classifier.parameters(), 'lr': 0.001}
])
Use case: Large datasets, best performance
# No pre-trained weights (rare)
model = timm.create_model('resnet50', pretrained=False)
# Train everything from scratch
Use case: Very large datasets, research
Practical code examples for creating transfer learning models.
import torch
import torch.nn as nn
import torchvision.models as models
# Load pre-trained ResNet50
backbone = models.resnet50(pretrained=True)
# Freeze backbone parameters
for param in backbone.parameters():
param.requires_grad = False
# Replace classifier head
num_classes = 37 # Oxford Pets classes
backbone.fc = nn.Linear(backbone.fc.in_features, num_classes)
# Only train classifier
optimizer = torch.optim.Adam(backbone.fc.parameters(), lr=0.001)
# Training loop
for epoch in range(num_epochs):
for images, labels in dataloader:
optimizer.zero_grad()
outputs = backbone(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
import timm
import torch.nn as nn
# Load pre-trained EfficientNet
backbone = timm.create_model('efficientnet_b0', pretrained=True)
# Freeze backbone parameters
for param in backbone.parameters():
param.requires_grad = False
# Replace classifier head
num_classes = 37
backbone.classifier = nn.Linear(backbone.classifier.in_features, num_classes)
# Only train classifier
optimizer = torch.optim.Adam(backbone.classifier.parameters(), lr=0.001)
# Training loop
for epoch in range(num_epochs):
for images, labels in dataloader:
optimizer.zero_grad()
outputs = backbone(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
Adapting pre-trained models for specific classification tasks.
The final component that converts CNN features into class predictions.
The classifier head is the final component of a CNN that transforms extracted features into class predictions. It takes high-level features from the CNN backbone and outputs probability scores for each class.
# Features from CNN backbone
features.shape = (N, C, H, W)
# Example: (32, 2048, 7, 7) from ResNet50
# Global Average Pooling + Linear Layer
# (N, C, H, W) → (N, C) → (N, num_classes)
# Class probabilities
predictions.shape = (N, num_classes)
# Example: (32, 37) for Oxford Pets
Key components that make up a classifier head.
Purpose: Reduces spatial dimensions from (N, C, H, W) to (N, C, 1, 1)
Advantages: Reduces parameters, prevents overfitting, translation invariant
# Global Average Pooling
gap = nn.AdaptiveAvgPool2d(1)
# Input: (32, 2048, 7, 7)
# Output: (32, 2048, 1, 1)
Purpose: Maps features to class scores
Input: Number of features from backbone
Output: Number of classes in dataset
# Linear layer for classification
classifier = nn.Linear(2048, 37)
# Input: 2048 features (ResNet50)
# Output: 37 classes (Oxford Pets)
Softmax: Converts scores to probabilities
Sigmoid: For multi-label classification
ReLU: For hidden layers
# Softmax for single-label classification
probabilities = F.softmax(logits, dim=1)
# Sigmoid for multi-label classification
probabilities = torch.sigmoid(logits)
Different approaches to designing classifier heads.
Architecture: GAP → Linear layer
Use case: Small datasets, quick prototyping
Parameters: Minimal parameters
# Simple classifier head
class SimpleClassifier(nn.Module):
def __init__(self, num_features, num_classes):
super().__init__()
self.gap = nn.AdaptiveAvgPool2d(1)
self.classifier = nn.Linear(num_features, num_classes)
def forward(self, x):
x = self.gap(x).flatten(1)
return self.classifier(x)
# Usage
classifier = SimpleClassifier(2048, 37)
Architecture: GAP → Linear → ReLU → Dropout → Linear
Use case: Complex datasets, better performance
Parameters: More parameters, more capacity
# Multi-layer classifier head
class MultiLayerClassifier(nn.Module):
def __init__(self, num_features, num_classes, hidden_dim=512):
super().__init__()
self.gap = nn.AdaptiveAvgPool2d(1)
self.classifier = nn.Sequential(
nn.Linear(num_features, hidden_dim),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x):
x = self.gap(x).flatten(1)
return self.classifier(x)
# Usage
classifier = MultiLayerClassifier(2048, 37, hidden_dim=512)
Architecture: Custom design for specific tasks
Use case: Specialized requirements
Parameters: Flexible design
# Custom classifier head
class CustomClassifier(nn.Module):
def __init__(self, num_features, num_classes):
super().__init__()
self.gap = nn.AdaptiveAvgPool2d(1)
self.feature_extractor = nn.Sequential(
nn.Linear(num_features, 1024),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Dropout(0.3)
)
self.classifier = nn.Linear(512, num_classes)
def forward(self, x):
x = self.gap(x).flatten(1)
features = self.feature_extractor(x)
return self.classifier(features)
# Usage
classifier = CustomClassifier(2048, 37)
Practical code examples for creating classifier heads.
import torch
import torch.nn as nn
import torchvision.models as models
# Load pre-trained ResNet50
backbone = models.resnet50(pretrained=True)
# Replace classifier head
num_classes = 37 # Oxford Pets classes
backbone.fc = nn.Linear(backbone.fc.in_features, num_classes)
# Check the new classifier
print(f"Classifier input features: {backbone.fc.in_features}")
print(f"Classifier output classes: {backbone.fc.out_features}")
# Usage
model = backbone
outputs = model(images) # Shape: (batch_size, 37)
import timm
import torch.nn as nn
# Create model with custom number of classes
model = timm.create_model('efficientnet_b0',
pretrained=True,
num_classes=37)
# Check model structure
print(f"Model: {model}")
print(f"Classifier: {model.classifier}")
# Usage
outputs = model(images) # Shape: (batch_size, 37)
import torch
import torch.nn as nn
# Create custom classifier head
class CustomClassifierHead(nn.Module):
def __init__(self, backbone_features, num_classes):
super().__init__()
self.gap = nn.AdaptiveAvgPool2d(1)
self.classifier = nn.Sequential(
nn.Linear(backbone_features, 512),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.gap(x).flatten(1)
return self.classifier(x)
# Usage with ResNet50 backbone
backbone = models.resnet50(pretrained=True)
# Remove original classifier
backbone.fc = nn.Identity()
# Add custom classifier
classifier = CustomClassifierHead(2048, 37)
# Complete model
class CompleteModel(nn.Module):
def __init__(self, backbone, classifier):
super().__init__()
self.backbone = backbone
self.classifier = classifier
def forward(self, x):
features = self.backbone(x)
return self.classifier(features)
model = CompleteModel(backbone, classifier)
Different approaches to training classifier heads.
Strategy: Only train classifier head
Use case: Small datasets, quick training
Advantages: Fast training, less overfitting
# Freeze backbone parameters
for param in backbone.parameters():
param.requires_grad = False
# Only train classifier
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)
# Training loop
for epoch in range(num_epochs):
for images, labels in dataloader:
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
Strategy: Train entire model
Use case: Large datasets, best performance
Advantages: Better performance, full model adaptation
# Unfreeze all parameters
for param in model.parameters():
param.requires_grad = True
# Use different learning rates
optimizer = torch.optim.Adam([
{'params': backbone.parameters(), 'lr': 0.0001},
{'params': classifier.parameters(), 'lr': 0.001}
])
# Training loop
for epoch in range(num_epochs):
for images, labels in dataloader:
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
Strategy: Unfreeze layers gradually
Use case: Balanced approach
Advantages: Stable training, good performance
# Start with frozen backbone
for param in backbone.parameters():
param.requires_grad = False
# Train classifier first
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)
# ... train for several epochs ...
# Unfreeze later layers
for param in backbone.layer4.parameters():
param.requires_grad = True
# Continue training
optimizer = torch.optim.Adam([
{'params': backbone.layer4.parameters(), 'lr': 0.0001},
{'params': classifier.parameters(), 'lr': 0.001}
])
Guidelines for designing and training classifier heads.
Key findings and insights from the CNN classification experiments.
High-level overview of model performance and key findings.
VGG16 without data augmentation achieved the highest performance on Oxford Pets dataset.
MobileNetV3 provides good balance between accuracy and speed for mobile deployment.
Main findings and lessons learned from the experiments.
For comprehensive performance analysis, explore the sections below.
Detailed performance metrics, confusion matrices, and model rankings.
Visualize CNN layer outputs and understand what models learn.
Model architectures, training parameters, and data configurations.
Practical guidance for model selection and deployment.
Compare performance metrics across CNN models
| Model | Precision ↑ | Recall ↑ | F1-Score ↑ | Accuracy ↑ | Top-5 Acc ↑ |
|---|
| Model | Training Time ↓ (min) | Inference Time ↓ (ms) | FLOPs ↓ (B) | Parameters ↓ (M) | Model Size ↓ (MB) |
|---|
Overall correctness of the model across all classes
Proportion of positive predictions that are actually correct
Proportion of actual positives that are correctly identified
Harmonic mean of precision and recall
Accuracy when considering top-5 predictions
Simple average across all classes (treats all classes equally)
Average weighted by the number of true instances for each class
Total floating point operations for forward pass
Interactive feature maps and model analysis visualizations.
Explore how different CNN layers extract features from images.
Select a model, layer, and sample to view feature maps.
Training and validation performance over epochs.
Visual representation of CNN architectures and layer connections.
Training and evaluation configurations for all models in the comparison table.
Select a model from the comparison table above to view its confusion matrix
Click on a model row in the comparison table to view its confusion matrix