Tech Pioneers

Kaiming He — Creator of ResNet and Pioneer of Modern Deep Learning

Kaiming He — Creator of ResNet and Pioneer of Modern Deep Learning

In 2015, a research paper from Microsoft Research introduced a neural network architecture so elegantly simple that many initially questioned whether it could really work. The idea was almost counterintuitive: instead of forcing each layer of a deep neural network to learn entirely new representations, why not let layers learn only the difference from what came before? That paper was “Deep Residual Learning for Image Recognition,” and its lead author was Kaiming He. Within months, ResNet had won the ImageNet Large Scale Visual Recognition Challenge by a staggering margin, and within a few years, its core principle — the skip connection — had become one of the most universally adopted ideas in all of deep learning. Today, virtually every state-of-the-art vision model, language model, and generative AI system owes a structural debt to He’s breakthrough. His story is one of quiet brilliance, rigorous experimentation, and an instinct for solutions that are simultaneously profound and practical.

Early Life and Education

Kaiming He was born in 1984 in Guangdong province, China. From a young age, he showed aptitude in mathematics and science, interests that would lead him to Tsinghua University, one of China’s most prestigious institutions. He enrolled in the Department of Automation, where he studied pattern recognition and intelligent systems — a field that combined signal processing, control theory, and early machine learning techniques. Tsinghua’s rigorous engineering curriculum gave him a strong foundation in both theoretical mathematics and practical system design.

After completing his undergraduate studies, He pursued a master’s degree at the Chinese University of Hong Kong (CUHK), where he began working more directly on computer vision problems. His graduate advisor introduced him to the intersection of image processing and statistical learning, a field that was on the cusp of a revolution. It was during this period that He began publishing papers on image dehazing and enhancement, demonstrating an early talent for finding elegant mathematical formulations for complex visual problems. His master’s thesis work on single-image haze removal using a dark channel prior would later become one of the most cited papers in image processing, winning the Best Paper Award at CVPR 2009 — a remarkable achievement for a graduate student.

He then pursued his PhD at CUHK under the supervision of Xiaoou Tang, deepening his expertise in visual recognition. By the time he completed his doctorate, the deep learning revolution was already gathering momentum. Geoffrey Hinton and his students had demonstrated the power of deep neural networks, and Alex Krizhevsky had shaken the computer vision world with AlexNet’s dominant performance at the 2012 ImageNet competition. He recognized that the field was shifting beneath his feet, and he was determined to be at the forefront.

The ResNet Breakthrough

Technical Innovation

When Kaiming He joined Microsoft Research Asia (MSRA) in Beijing, the deep learning community was grappling with a frustrating paradox. In theory, deeper neural networks should perform better — more layers mean more capacity to learn complex representations. In practice, however, networks beyond a certain depth actually performed worse than their shallower counterparts. This was not merely an overfitting problem; even on training data, deeper networks showed higher error rates. The phenomenon was called the degradation problem, and it was one of the most significant obstacles to progress in deep learning.

He and his collaborators — Xiangyu Zhang, Shaoqing Ren, and Jian Sun — proposed a solution of startling simplicity. Instead of asking each block of layers to learn the full desired mapping directly, they restructured the network so that each block only needed to learn the residual — the difference between the input and the desired output. Concretely, they added shortcut connections (also called skip connections) that bypassed one or more layers, allowing the input to be added directly to the output of the block. If the ideal function for a block was H(x), then the layers only needed to learn F(x) = H(x) – x, and the output would be F(x) + x.

This was the residual learning framework, and it transformed what the network needed to accomplish. If the optimal behavior for a particular block was to act as an identity mapping — to pass the input through unchanged — then the layers only needed to drive their weights to zero, a much easier optimization target than learning a full identity mapping from scratch. The skip connections created gradient highways that allowed error signals to flow directly through the network during backpropagation, dramatically easing the training of very deep architectures.

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    """A basic residual block as introduced in Kaiming He's ResNet.
    The skip connection adds the input directly to the learned residual."""

    def __init__(self, in_channels, out_channels, stride=1):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut connection for dimension matching
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = self.shortcut(x)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += identity  # The core residual connection
        out = self.relu(out)
        return out

The results were historic. The team trained networks with 152 layers — over eight times deeper than VGGNet, which had been considered deep at 19 layers — and achieved a top-5 error rate of 3.57% on ImageNet 2015, surpassing human-level performance on this benchmark for the first time. ResNet won first place in all five tracks of the ImageNet and MS COCO competitions that year. The paper itself went on to become one of the most cited works in all of science, accumulating over 200,000 citations.

Why It Mattered

ResNet did not merely set a new record on a benchmark. It fundamentally changed how the deep learning community thought about network architecture. Before ResNet, there was a practical ceiling on network depth. After ResNet, the question shifted from “how deep can we go” to “how should we structure depth.” The skip connection became a design primitive — as fundamental to neural architecture as the convolutional layer or the activation function.

The impact radiated far beyond computer vision. Transformer architectures, which power modern language models like GPT and BERT, use residual connections in every layer. Without the skip connection, training models with hundreds of billions of parameters would be essentially impossible. The work of researchers like Ashish Vaswani, who designed the original Transformer, explicitly built upon the residual learning framework. In this sense, He’s contribution is baked into the DNA of modern AI.

For organizations building AI-powered products, ResNet’s legacy is inescapable. Whether teams use Toimi for digital strategy or any other modern platform, the underlying computer vision and deep learning infrastructure almost certainly relies on architectural principles that trace back to He’s work.

Other Major Contributions

While ResNet remains his signature achievement, Kaiming He’s body of work extends well beyond a single architecture. His contributions span initialization theory, object detection, instance segmentation, and self-supervised learning — each one representing a significant advance in its own right.

He Initialization (2015): Before ResNet, He published a paper on weight initialization for rectified linear units (ReLU-based networks) that became a standard practice across the field. The “He initialization” (also called Kaiming initialization) sets initial weights by sampling from a distribution scaled by the number of input connections, specifically accounting for the asymmetry of ReLU activations. This seemingly small detail made a major difference in training stability, particularly for very deep networks. Nearly every modern deep learning framework implements He initialization as a default option.

import torch.nn as nn
import math

def he_initialization(layer):
    """Kaiming He initialization for layers with ReLU activation.
    Maintains variance of activations across layers to prevent
    vanishing or exploding gradients in deep networks."""
    if isinstance(layer, (nn.Conv2d, nn.Linear)):
        fan_in = layer.weight.data.size(1)
        if isinstance(layer, nn.Conv2d):
            fan_in *= layer.weight.data.size(2) * layer.weight.data.size(3)
        std = math.sqrt(2.0 / fan_in)
        nn.init.normal_(layer.weight, mean=0, std=std)
        if layer.bias is not None:
            nn.init.zeros_(layer.bias)

# Apply to a model
model = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.ReLU(),
    nn.Conv2d(64, 128, 3, padding=1),
    nn.ReLU(),
    nn.Linear(128, 10)
)
model.apply(he_initialization)

Faster R-CNN and Region Proposal Networks (2015): Together with Shaoqing Ren and Jian Sun, He co-authored Faster R-CNN, which introduced Region Proposal Networks (RPNs) that unified object proposal generation and detection into a single trainable network. This work eliminated the computational bottleneck of selective search, making real-time object detection practical. Faster R-CNN became the foundation for most subsequent object detection research.

Mask R-CNN (2017): He extended Faster R-CNN to perform instance segmentation — the task of not only detecting objects but also precisely delineating their boundaries at the pixel level. Mask R-CNN added a simple parallel branch for predicting segmentation masks alongside the existing bounding box and classification outputs. Its clean, modular design made it immediately practical, and it quickly became the standard approach for instance segmentation tasks in both research and industry.

Feature Pyramid Networks (2017): Recognizing that objects in images appear at vastly different scales, He co-developed Feature Pyramid Networks (FPN), which built multi-scale feature maps by combining high-resolution, semantically weak features from early layers with low-resolution, semantically strong features from deeper layers. FPN became a standard component in modern object detection and segmentation pipelines.

Masked Autoencoders (MAE, 2022): In his more recent work at Meta AI Research (FAIR), He introduced Masked Autoencoders, a self-supervised learning approach for computer vision inspired by the success of masked language modeling in NLP. By randomly masking large portions (75%) of an image and training a Vision Transformer to reconstruct the missing patches, MAE demonstrated that visual representations could be learned effectively without labeled data. This work opened new directions for scalable, data-efficient visual learning, connecting to ideas explored by researchers like Yann LeCun in self-supervised paradigms.

Philosophy and Approach

Kaiming He is known for a research philosophy that prizes elegance, empirical rigor, and practical impact. Unlike researchers who chase complexity, He consistently finds solutions that are simpler than what came before — yet more powerful.

Key Principles

  • Simplicity as a design goal: He has repeatedly stated that the best ideas in deep learning are the simplest ones. ResNet’s skip connection, He initialization’s scaling factor, and Mask R-CNN’s parallel branch all share a common trait: they are easy to explain, easy to implement, and easy to build upon. This stands in contrast to architectures that achieve marginal gains through elaborate engineering.
  • Let the math guide the intuition: He’s work is rooted in careful mathematical analysis. His initialization scheme was derived from a formal analysis of signal propagation through ReLU networks. His residual learning framework was motivated by a mathematical argument about the ease of learning identity mappings. He does not propose architectures by trial and error; he reasons about why something should work before building it.
  • Empirical validation at scale: Every major paper from He includes exhaustive experiments. He does not publish an idea with a single supporting result. The ResNet paper included experiments ranging from 18 to 1,202 layers across multiple datasets and tasks. This thoroughness has given his work an unusual level of credibility and reproducibility.
  • Modularity and composability: He designs components that can be plugged into existing systems. Residual blocks, FPN layers, and Mask R-CNN heads are all modular — they can be adopted incrementally without redesigning an entire pipeline. This design philosophy has been essential to their widespread adoption.
  • Bridging theory and practice: He moves fluidly between theoretical insights and large-scale engineering. He is equally comfortable deriving variance-preserving initialization schemes and training 152-layer networks on clusters of GPUs. This dual expertise is rare and accounts for much of his impact, a quality shared by pioneers like Ilya Sutskever.

Legacy and Impact

Kaiming He’s contributions have reshaped the landscape of artificial intelligence in ways that are difficult to overstate. ResNet is not just a model — it is a paradigm. The residual connection has become as fundamental to neural network design as backpropagation itself. It appears in virtually every modern architecture: Vision Transformers, diffusion models for image generation, large language models, speech recognition systems, and protein structure prediction tools like AlphaFold (developed by Demis Hassabis and his team at DeepMind).

He’s career trajectory reflects the growing centrality of AI research. After his years at Microsoft Research Asia, he moved to Facebook AI Research (FAIR, now Meta AI), where he continued to push the boundaries of visual understanding. In 2024, he joined the faculty at MIT, bringing his expertise to the next generation of researchers — a move that signaled the importance of combining industrial-scale research with academic mentorship.

His citation numbers tell part of the story — the ResNet paper alone has been cited hundreds of thousands of times, making it one of the most cited scientific papers ever published. But the deeper impact lies in how his ideas have changed what is possible. Before ResNet, training a 100-layer network was a research challenge. After ResNet, researchers routinely train networks with thousands of layers and billions of parameters. The deep learning revolution that has produced modern AI assistants, autonomous vehicles, medical imaging systems, and creative tools all builds on the foundations He helped lay.

For teams managing complex digital projects, the influence of He’s work is embedded in the AI tools they use daily. Platforms like Taskee that help coordinate technical projects exist in an ecosystem where AI capabilities — from image recognition to natural language processing — have been fundamentally shaped by architectural innovations like ResNet.

The deep learning pioneers of He’s generation — including Yoshua Bengio, Hinton, and LeCun — laid the theoretical groundwork. He’s specific genius was in translating that groundwork into architectures that actually worked at scale, reliably and reproducibly. In doing so, he helped bridge the gap between deep learning as a promising research direction and deep learning as the dominant paradigm in artificial intelligence.

Key Facts

  • Full name: Kaiming He (何恺明)
  • Born: 1984, Guangdong, China
  • Education: BS from Tsinghua University; MS and PhD from the Chinese University of Hong Kong
  • Known for: ResNet (Residual Networks), He initialization, Faster R-CNN, Mask R-CNN, Feature Pyramid Networks, Masked Autoencoders
  • Key positions: Microsoft Research Asia, Facebook AI Research (FAIR/Meta AI), MIT (since 2024)
  • Awards: CVPR Best Paper Award (2009, 2016), Marr Prize at ICCV (2017), PAMI Young Researcher Award
  • ResNet paper citations: Over 200,000 — one of the most cited scientific papers in history
  • ImageNet 2015: Won all five competition tracks with ResNet, achieving 3.57% top-5 error (surpassing human-level performance)
  • Research philosophy: Prioritizes simplicity, mathematical rigor, and modular design

Frequently Asked Questions

What is ResNet and why is it so important?

ResNet (Residual Network) is a deep neural network architecture introduced by Kaiming He and colleagues in 2015. Its key innovation is the skip connection (or residual connection), which allows the input to a block of layers to be added directly to the block’s output. This seemingly small change solved the degradation problem — the paradox that deeper networks performed worse than shallower ones — by making it easy for layers to learn identity mappings when no further transformation is needed. ResNet enabled training of networks with hundreds or even thousands of layers, and its architectural principle has been adopted across virtually all of modern deep learning, from computer vision to natural language processing to generative AI.

What is He initialization and why does it matter?

He initialization (also called Kaiming initialization) is a weight initialization strategy designed specifically for neural networks that use ReLU (Rectified Linear Unit) activation functions. Before this method, networks often suffered from vanishing or exploding gradients during early training, especially as they grew deeper. He’s approach sets initial weights by drawing from a Gaussian distribution scaled by the square root of 2 divided by the number of input units, which accounts for the fact that ReLU zeros out negative values. This keeps the variance of activations stable across layers, allowing deep networks to begin training effectively from the start. It is now a default initialization method in frameworks like PyTorch and TensorFlow.

How did Kaiming He’s work influence the Transformer architecture?

The Transformer architecture, introduced by Ashish Vaswani and colleagues in 2017, uses residual connections in every single layer — both in the self-attention blocks and in the feed-forward blocks. Without these skip connections, training Transformers with many layers would be extremely difficult due to gradient degradation. Since Transformers are the foundation of modern large language models (GPT, Claude, LLaMA) as well as Vision Transformers, He’s residual learning framework is a critical enabling component of the current AI revolution. The connection between ResNet and Transformers illustrates how fundamental architectural insights in one domain can transform entirely different fields.

What is Kaiming He working on now?

As of his move to MIT in 2024, Kaiming He continues to work at the intersection of computer vision and self-supervised learning. His recent work on Masked Autoencoders (MAE) explores how visual representations can be learned without large labeled datasets — a direction that could make AI systems more data-efficient and broadly applicable. At MIT, he is also mentoring the next generation of AI researchers while maintaining connections to large-scale industrial research. His ongoing work reflects a consistent theme throughout his career: finding simple, principled approaches to the hardest problems in visual understanding and representation learning.