Building ResNet from Scratch: The Architecture That Changed Deep Learning Forever

Introduction to ResNet: A Revolution in Neural Networks

In the mid-2010s, the field of deep learning was dramatically reshaped by the introduction of Residual Networks, or ResNet. Prior to its arrival, deeper neural network architectures often suffered from a phenomenon known as the “vanishing gradient problem,” where layers closer to the input received signals so weak they could barely learn anything during training. This challenge stifled progress and left researchers searching for effective ways to build much deeper models.

ResNet, proposed by Kaiming He and his colleagues at Microsoft Research in 2015, revolutionized how we design and train very deep neural networks. The key breakthrough was the introduction of residual connections—a structural innovation that elegantly sidestepped the vanishing gradient problem by allowing information to bypass certain layers. Instead of each layer learning an entirely new representation, it learns the difference (or “residual”) between its input and output. This simple tweak enabled researchers to create neural networks with hundreds, even thousands, of layers—a feat previously thought unattainable.

By allowing “shortcut” paths for the gradient to flow unimpeded, ResNet made it possible for neural networks to keep getting deeper without a loss in training performance. This approach quickly proved transformative: at the 2015 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), ResNet smashed previous benchmarks, achieving a top-5 error rate of just 3.57% and outclassing shallower competitors.

The implications of this were immediate and far-reaching. ResNet’s architecture not only set new performance records but also provided a robust blueprint for building exceedingly deep networks. Models built on residual networks have since powered advances in everything from image recognition to natural language processing and generative models. The original ResNet paper has become one of the most-cited works in AI literature, influencing models deployed in facial recognition, self-driving cars, medical diagnosis, and beyond.

Today, whether you’re aiming for state-of-the-art results in computer vision or seeking a deeper understanding of why modern AI works so well, knowing the fundamentals of ResNet is essential. As you’ll see, building ResNet from scratch isn’t just a technical exercise—it’s a journey through one of the most transformative advances in deep learning history.

Understanding the Problem: Why Do Deep Networks Fail?

Before the advent of ResNet, deep learning researchers often struggled with a perplexing problem: as neural networks grew deeper, their performance began to plateau and even degrade—a phenomenon now known as the degradation problem. This wasn’t just about overfitting; surprisingly, adding more layers to a well-constructed neural network sometimes led to worse training and test error, contradicting intuitive expectations.

This failure has its roots in a few fundamental challenges:

Vanishing and Exploding Gradients: As information is propagated backward through a deep network during training, gradients can become extremely small (vanish) or grow uncontrollably large (explode). Vanishing gradients, in particular, make it difficult for earlier layers to learn, effectively making it harder for the overall network to converge to a good solution. If you’re interested in the mathematical underpinnings, consider reading Stanford’s CS231n lecture notes on this topic.
Network Optimization Difficulties: Deep networks introduce highly non-convex optimization landscapes. As the number of layers increases, the potential for getting stuck in suboptimal local minima or saddle points rises, making training more unpredictable. The Google AI Blog provides an accessible discussion on the challenges of training very deep networks.
Information Degradation: As you add more layers, it becomes harder for information (both features and gradients) to flow effectively through the network. The deeper the network, the more likely the meaningful features and learning signals are to become corrupted or lost altogether, reducing the ability to model complex functions.

To illustrate the degradation problem, imagine you designed a 56-layer neural network for an image classification task. According to traditional wisdom, it should outperform a 20-layer network—after all, it has more representational power. However, empirical results showed that classical deep networks often performed worse as more layers were stacked, both in terms of higher training error and lower accuracy. This was documented in the original ResNet paper by He et al., which showed clear evidence of deeper plain (non-residual) networks failing to learn effectively.

Clearly, the issue wasn’t simply about more data or more layers. The core challenge was to design architectures where depth adds value—preserving the benefits of additional layers without succumbing to the degradation problem. This became the impetus for innovative solutions, culminating in the breakthrough that is ResNet. For a more detailed exploration on this, you might find Andrew Ng’s DeepLearning.AI newsletter insightful, where he discusses why deeper isn’t always better and what architectural advances turned the tide.

The Core Innovation: Residual Blocks Explained

At the heart of ResNet’s revolutionary impact is its core architectural unit: the residual block. This deceptively simple concept fundamentally changed how deep learning models are built and trained, addressing challenges that previously hindered progress as networks grew deeper.

Before ResNet, stacking more layers onto neural networks often resulted in a perplexing problem known as the vanishing gradient problem. As networks deepened, they didn’t automatically perform better. On the contrary, their accuracy started to degrade. Theoretically, a deeper network should always perform at least as well as its shallower counterpart, but in practice, it often performed worse during training.

This is where residual blocks come in. The insight, first proposed by He et al. (2015), was to introduce shortcut connections—often referred to as skip connections—that allow the input of a layer to bypass one or more subsequent layers and be added directly to their output. Mathematically, instead of learning a direct mapping from input x to output H(x), residual blocks enable the network to learn the residual function, F(x) = H(x) - x. The original input x is then added back: H(x) = F(x) + x.

Step by step, a basic residual block works as follows:

Input Layer: The input, let’s call it x, is fed into the first convolutional layer of the block.
Transformation: The convolutional layers (typically two, sometimes with batch normalization and nonlinear activation functions like ReLU) process the input to produce an output, or F(x).
Shortcut Connection: The original input x bypasses these convolutional layers and is added directly to their output, forming F(x) + x.
Output Layer: The result, a sum of the transformed input with its original version, is passed further down the network.

This shortcut bypass ensures smoother backpropagation of gradients, effectively mitigating the vanishing gradient problem and allowing the network to train efficiently even with hundreds or thousands of layers. For an interactive visualization and more mathematical intuition, the Distill.pub article on Residual Networks offers a brilliant, in-depth guide.

To see residual blocks in practice, here’s a simplified example with two layers:

def basic_residual_block(x):
    out = conv_relu(x)
    out = conv_relu(out)
    out += x  # This is the shortcut connection
    return relu(out)

This radical architectural change enabled networks like ResNet-50, ResNet-101, and even deeper variants to dominate standard benchmarks like ImageNet and be deployed in a wide array of applications—from medical diagnosis to autonomous vehicles. Today, residual connections are a standard technique across deep learning, forming the backbone of not just vision models, but influencing architectures in natural language processing, speech recognition, and beyond.

Architectural Breakdown: Layers and Connections in ResNet

The architecture of ResNet represents a pivotal moment in deep learning, primarily due to its innovative handling of layers and connections. At its core, ResNet—short for Residual Network—proposes a simple yet powerful change: the incorporation of identity shortcut (skip) connections that enable layers to skip one or more layers. This architectural tweak paved the way for neural networks to grow dramatically deeper without suffering from the vanishing gradient problem that previously hindered deep models. Let’s take a closer look at the anatomy of ResNet and unravel its components.

Core Layer Types and Their Functions

Convolutional Layers: ResNet, like traditional neural networks for image tasks, starts with convolutional layers. These layers extract spatial features from input images, capturing textures, edges, and patterns. For illustrative implementation details, review the basics of convolutional layers here from Stanford’s CS231n notes.
Batch Normalization: Each convolutional block typically includes batch normalization, which stabilizes and accelerates the learning process by standardizing activations. This ensures robust performance even in very deep architectures. For further reading, see the original batch normalization paper on arXiv.
Activation Functions (ReLU): The rectified linear unit (ReLU) is applied after batch normalization, injecting non-linearity and helping to combat the vanishing gradient issue.

The Residual Block: Introducing Skip Connections

What truly differentiates ResNet from earlier architectures is the residual block. Within each block, the input (called the “identity”) skips ahead and is added to the output of a stack of convolutional layers. This is often referred to as the skip connection. The mathematical form can be simplified as:

y = F(x, {Wi}) + x

where F(x, {Wi}) is the residual mapping to be learned, and x is the input. This ingenious design means that the network only needs to learn the difference between the input and the target output, rather than the full transformation. It has a profound effect: even very deep networks (with 50, 101, or 152 layers!) can be trained effectively. For a deeper dive into why skip connections work, check out this detailed explanation from DeepLearning.AI.

Stacking Blocks: Building a Deep ResNet

ResNet is often implemented in configurations like ResNet-18, ResNet-34, ResNet-50, ResNet-101, and even ResNet-152. These numbers refer to the depth of the network, or how many layers it has. The building process typically follows these steps:

Initial Convolution: The model begins with a single, large-kernel convolutional layer and subsampling to quickly reduce spatial resolution.
Series of Residual Blocks: Stages of 2, 3, 4, or more residual blocks are stacked, each increasing in feature depth. Variants like ResNet-50 use “bottleneck” blocks for efficiency—three layers per block rather than two.
Pooling & Classification Head: After the stack of residual blocks, global average pooling is applied, reducing the feature maps to a vector. This is followed by a fully connected layer for the final classification.

Visualization and sample code structures for these steps can be found in guides like the PyTorch ResNet documentation.

Variations: Bottleneck Blocks and Downsampling

Deeper versions of ResNet (like ResNet-50 and beyond) utilize bottleneck residual blocks. A bottleneck block typically has three layers: a 1×1 convolution for dimensionality reduction, a 3×3 convolution for spatial filtering, and another 1×1 convolution for restoring dimensions. This structure economizes on parameters while maintaining expressiveness. More on these can be explored in this seminal paper from Microsoft Research.

Downsampling & Dimension Matching

To allow the network to operate at multiple scales, ResNet uses either convolutions with stride greater than one or pooling layers for downsampling. When the dimensions of the skip connection and output don’t match, a projection shortcut (often a 1×1 convolution) aligns their dimensions, preserving the core idea of the residual connection.

An easy-to-follow breakdown of the architectural layers, including sample diagrams and further implementation details, is available at Towards Data Science.

Each of these architectural elements—layer types, residual connections, bottleneck blocks, and downsampling strategies—combine to make ResNet a highly scalable and resilient framework. By allowing networks to go deeper without degradation in accuracy, ResNet has shaped everything from computer vision research to industry deployment of deep learning systems.

Implementing ResNet from Scratch: Key Steps and Components

Building your own ResNet implementation from scratch isn’t just a coding exercise—it’s a journey through some of the most game-changing ideas in deep learning. Here’s a detailed walkthrough of the core components and steps required to re-create this revolutionary architecture.

1. Understanding the Essence of Residual Learning

ResNet’s impact rests on the concept of residual learning. Traditional deep neural networks often suffer from vanishing gradients as they grow deeper, which makes training difficult. ResNet addresses this by introducing “skip connections” or “identity shortcuts” that allow the model to learn residual functions relative to the layer inputs, rather than the unreferenced functions. This concept was first introduced in the original ResNet paper by Kaiming He et al. (2015).

2. The Core Building Block: Residual Blocks

At the heart of ResNet lies the residual block. Each block consists of two or three convolutional layers, with a shortcut that bypasses these layers:

Define the main path with sequential Conv → BatchNorm → ReLU → Conv → BatchNorm layers.
Add the identity shortcut by directly adding the block’s input to the output of the stacked layers.
If the input and output dimensions differ (due to stride or channel changes), use a projection shortcut (e.g., a 1×1 convolution) to match dimensions.

This basic structure powers the deeper ResNet variants, and can be implemented in code as a class, e.g., ResidualBlock in PyTorch or TensorFlow. Here’s a more detailed step-by-step implementation using PyTorch theoretical code:

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, stride=stride)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels))
    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = self.relu(out)
        return out

For a thorough understanding, review Harvard’s introductory slides on ResNet.

3. Stacking Residual Blocks to Build the Network

The classic ResNet models—ResNet-18, ResNet-34, ResNet-50, and beyond—are built by stacking residual blocks in specific patterns. The arrangement for ResNet-34, for instance, looks like this:

Initial convolutional layer (with batch normalization and max pooling).
Four groups of residual blocks, each group increasing in channel number (64, 128, 256, 512) and halving the spatial resolution (except for the first group).
End with an average pooling layer and a fully connected output layer.

Each stage may have a block that changes the input-output dimension via projection shortcuts, while the rest use identity shortcuts. For detailed patterns, refer to this deep dive on ResNet architectures.

4. Implementing Training and Optimization Tricks

ResNet achieved unprecedented results on ImageNet by leveraging:

Batch Normalization: Stabilizes training and allows the use of higher learning rates. More details on batch norm can be found at Stanford’s CV tutorial.
Data Augmentation: Techniques like random cropping and horizontal flipping prevent overfitting.
He Initialization: Proper weight initialization is crucial with deep networks (read more at the original paper on initialization).

Adapt these strategies in your implementation for maximum performance, especially when working with very deep networks.

5. Validating Your Implementation

Testing your ResNet from scratch involves:

Overfitting on a Small Dataset: Confirm the model can overfit a tiny dataset, ensuring that forward and backward passes work as expected.
Visualizing Activations: Use visualization libraries to inspect intermediate outputs for debugging.
Benchmarking Against Reference Implementations: Compare performance and accuracy with official open-source models, such as those in torchvision or TensorFlow Keras Applications.

Through this methodical process, you’ll gain not only a powerful model but also insight into why ResNet is an enduring pillar in deep learning. For a comprehensive step-by-step implementation with code, diagrams, and best practices, check out the excellent resources at Stanford’s CS231n.

Training and Evaluating Your Custom ResNet Model

After architecting your custom ResNet, the next major milestones are training the model and evaluating its performance. This process is not only crucial for verifying whether your implementation works as expected, but also for uncovering the strengths and limitations of your neural architecture. Let’s break down the key phases involved in this transformative journey, emphasizing best practices and offering detailed, actionable steps.

Preparing Your Dataset

Training any deep learning model begins with data. For ResNet, large-scale datasets such as ImageNet or CIFAR-10 are popular choices, thanks to their rich variety and standardized benchmarks. Ensure your dataset is well-organized, with clear directory structures for training, validation, and test sets.

Data Augmentation: Techniques like random cropping, horizontal flipping, and normalization improve generalization. Libraries like torchvision.transforms in PyTorch or tf.image in TensorFlow can automate these tasks.
Preprocessing: Scale your images to a consistent size, typically 224×224 pixels for ResNet-50. Normalize pixel values to the mean and standard deviation of the dataset for stability.

Configuring the Training Loop

A robust training loop is essential for effective learning. Here’s what you should consider:

Optimizer Selection: Stochastic Gradient Descent (SGD) with momentum is the gold standard for ResNet, though adaptive optimizers like Adam can be used for faster prototyping.
Learning Rate Scheduling: Employ step decay, cosine annealing, or cyclical learning rates to navigate the complex loss landscape. See the original ResNet paper for recommendations on learning schedules.
Regularization: Techniques like weight decay, dropout, and batch normalization play vital roles in preventing overfitting, especially with very deep networks.

Example (in PyTorch):

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

Monitoring Performance and Avoiding Pitfalls

As you train, track both accuracy and loss on your training and validation sets. Plotting these metrics, as recommended by Andrew Ng in his popular Deep Learning Specialization, helps you visualize overfitting or underfitting trends early.

Early Stopping: If validation performance plateaus or degrades, consider halting training to avoid overfitting.
Checkpoints: Regularly save model checkpoints. This practice safeguards your progress and allows rollback in case of experimentation mishaps.

Evaluating the Custom ResNet

Evaluation involves more than just accuracy. Calculate and analyze:

Confusion Matrix: Understand where your model misclassifies by plotting a confusion matrix. This can be done easily using scikit-learn’s confusion_matrix function.
Precision, Recall, and F1-Score: These metrics, also available through scikit-learn’s metrics, offer a nuanced assessment beyond accuracy, especially for imbalanced datasets.
Visual Inspection: Visualize activations and misclassified samples to develop an intuition for what your ResNet has learned. This technique, endorsed by researchers at Distill, can reveal both strengths and weaknesses.

Iterative Improvements

Training doesn’t stop after a single run. Analyze your results to diagnose bottlenecks:

If overfitting: Add regularization, increase data augmentation, or collect more data.
If underfitting: Deepen your ResNet or adjust learning rates and optimizer hyperparameters.

Remember, deep learning success hinges on cycles of training, evaluation, and refinement, a process championed by top researchers at Carnegie Mellon University. Taking an experimental, empirical approach will make your foray into ResNet architecture both rigorous and rewarding.