In June 2017, a paper appeared on arXiv with a title that sounded almost deliberately understated: “Attention Is All You Need.” Eight Google researchers were listed as authors, but one name — Noam Shazeer — had been driving toward the core ideas in that paper for years, long before anyone understood how completely the Transformer architecture would reshape artificial intelligence. Shazeer was not the first author, nor did he seek the spotlight. He was, by temperament and by practice, an engineer who preferred building things that worked to building a personal brand. But his fingerprints were on the most consequential technical decisions in the paper: the scaled dot-product attention, the multi-head attention mechanism, and the particular way the architecture dispensed entirely with recurrence and convolution. He had already invented the Mixture of Experts approach for scaling language models years earlier, and he had been a principal architect of Google’s ad-serving system — the engine that generated the majority of Alphabet’s revenue. After co-creating the architecture that would power GPT, BERT, PaLM, and virtually every large language model in existence, Shazeer left Google in 2021 to co-found Character.AI, a startup that explored the conversational potential of large language models. Then, in a move that surprised the industry, he returned to Google in 2024 in a deal reportedly worth over $2.7 billion, bringing his expertise back to the Gemini project. Noam Shazeer’s career is a study in how a single engineer’s intuition and persistence can redirect the entire trajectory of a field — and why the most important contributions in technology often come from people who care more about the work than about the credit.
Early Life and Education
Noam Shazeer grew up in a family that valued intellectual curiosity and mathematical thinking. He displayed an early aptitude for mathematics and computer science that led him to Duke University, where he earned his Bachelor of Science degree in mathematics. At Duke, Shazeer distinguished himself not through flashy competition victories but through a quiet, persistent engagement with difficult problems — the kind of engagement that would characterize his entire career. He was drawn to the intersection of mathematics and computation, fascinated by how abstract mathematical structures could be made to do practical work when implemented in software.
After completing his undergraduate studies, Shazeer joined Google in the early 2000s, when the company was still transitioning from a Stanford research project into the dominant force in internet technology. He did not pursue a Ph.D. — a decision that set him apart from many of his later collaborators in machine learning research. This absence of formal graduate training in AI was, paradoxically, an asset. Shazeer approached machine learning problems as an engineer and mathematician rather than as a specialist in any particular subfield, which gave him the freedom to think across disciplinary boundaries and to propose architectures that a more conventionally trained researcher might have dismissed as implausible.
At Google, Shazeer quickly became one of the most productive and influential engineers in the company’s history. His early work focused on the systems that would become the backbone of Google’s advertising business — the infrastructure that matched ads to search queries and web pages with a precision and scale that no previous system had achieved. This was not glamorous AI research, but it was engineering of the highest order: the systems Shazeer helped build needed to process billions of queries per day, make real-time decisions under extreme latency constraints, and continuously learn from new data. The discipline of building systems that worked reliably at planetary scale would prove invaluable when Shazeer later turned his attention to language models and neural architecture design.
The Transformer and Mixture of Experts Breakthroughs
Technical Innovation
Shazeer’s most famous contribution is his role as a co-author of the 2017 paper that introduced the Transformer architecture. But to understand the depth of his contribution, it is necessary to understand what came before and why it was inadequate.
By the mid-2010s, recurrent neural networks (RNNs) — particularly Long Short-Term Memory (LSTM) networks — were the dominant architecture for sequence modeling tasks like machine translation, language modeling, and speech recognition. RNNs processed sequences one token at a time, maintaining a hidden state that theoretically captured information about all previous tokens. In practice, this sequential processing created two fundamental problems. First, it was inherently slow: because each step depended on the output of the previous step, RNNs could not be parallelized across the sequence dimension, which meant that training on long sequences was computationally expensive. Second, despite the gating mechanisms in LSTMs, information still degraded over long distances — the network’s ability to attend to a token at position 5 when generating a token at position 500 was limited.
The Transformer solved both problems simultaneously through a mechanism called self-attention. Instead of processing tokens sequentially, the Transformer computed attention scores between every pair of tokens in the sequence in parallel. Each token could directly attend to every other token, regardless of distance, through a learned weighted combination. Shazeer’s contribution to the specific form of this attention — scaled dot-product attention with multi-head decomposition — was central. The scaling factor (dividing by the square root of the key dimension) prevented the dot products from growing too large in high-dimensional spaces, which would have pushed the softmax function into regions of extremely small gradients. The multi-head design allowed the model to attend to information from different representation subspaces at different positions simultaneously — one head might capture syntactic relationships while another captured semantic ones.
But Shazeer’s innovations extended well beyond the Transformer paper. Several years earlier, in 2017, he had published “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” — a paper that introduced a practical approach to scaling neural networks to enormous sizes without proportionally increasing computational cost. The key insight was elegant: instead of routing every input through the entire network, a learned gating function could select a small subset of specialized “expert” sub-networks for each input. This meant that a model could have trillions of parameters while only activating a fraction of them for any given input, keeping the computational cost manageable.
import numpy as np
class SimplifiedMixtureOfExperts:
"""
Demonstrates Noam Shazeer's Mixture of Experts (MoE) concept.
Core insight: instead of passing every input through a single
massive network, use a gating mechanism to route each input
to a small subset of specialized 'expert' networks.
Result: model capacity can scale to trillions of parameters
while keeping per-input compute cost constant.
"""
def __init__(self, input_dim, expert_dim, num_experts, top_k=2):
self.num_experts = num_experts
self.top_k = top_k # only activate top-k experts per input
# Each expert is a small feed-forward network
self.expert_weights = [
np.random.randn(input_dim, expert_dim) * 0.02
for _ in range(num_experts)
]
# Gating network: learns which experts to activate
self.gate_weights = np.random.randn(input_dim, num_experts) * 0.02
def gating(self, x):
"""
Shazeer's sparsely-gated routing mechanism.
Computes a probability distribution over experts,
then selects only the top-k for computation.
"""
logits = x @ self.gate_weights # (batch, num_experts)
# Add tunable noise for load balancing during training
noise = np.random.randn(*logits.shape) * 0.1
noisy_logits = logits + noise
# Select top-k experts per input
top_k_indices = np.argsort(noisy_logits, axis=-1)[:, -self.top_k:]
# Softmax only over selected experts (sparse routing)
mask = np.zeros_like(logits)
for i in range(len(x)):
mask[i, top_k_indices[i]] = 1.0
masked_logits = logits * mask + (1 - mask) * (-1e9)
exp_logits = np.exp(masked_logits - masked_logits.max(axis=-1, keepdims=True))
gates = exp_logits / exp_logits.sum(axis=-1, keepdims=True)
gates = gates * mask # zero out non-selected experts
return gates, top_k_indices
def forward(self, x):
"""
Forward pass: route each input through its top-k experts.
Total model parameters: num_experts * expert_params
Compute per input: top_k * expert_params
With 64 experts and top_k=2, only 3% of parameters
are activated per input — Shazeer's key scaling insight.
"""
gates, indices = self.gating(x) # (batch, num_experts)
batch_size = x.shape[0]
output = np.zeros((batch_size, self.expert_weights[0].shape[1]))
for i in range(self.num_experts):
gate_values = gates[:, i:i+1] # (batch, 1)
if np.any(gate_values > 1e-6):
expert_output = np.maximum(0, x @ self.expert_weights[i])
output += gate_values * expert_output
return output
# Demonstrate MoE scaling properties
input_dim = 512
expert_dim = 1024
num_experts = 64
top_k = 2
moe = SimplifiedMixtureOfExperts(input_dim, expert_dim, num_experts, top_k)
x = np.random.randn(8, input_dim) # batch of 8 inputs
result = moe.forward(x)
total_params = num_experts * input_dim * expert_dim
active_params = top_k * input_dim * expert_dim
efficiency = active_params / total_params
# 64 experts, but only 2 active per input = 3.1% compute cost
# This is how models like Switch Transformer and Gemini scale
The Mixture of Experts paper was published the same year as the Transformer paper, and the two ideas were deeply complementary. The Transformer provided an architecture that could process sequences in parallel with full attention across all positions. The MoE approach provided a way to scale that architecture to enormous sizes without proportionally increasing the cost of each forward pass. The combination of these two ideas — Transformer architecture with MoE layers — would become the foundation of some of the largest and most capable language models ever built, including Google’s Switch Transformer and the Gemini model family that Shazeer would later help develop.
Why It Mattered
The Transformer architecture did not merely improve upon existing approaches to sequence modeling — it replaced them entirely. Within two years of the paper’s publication, virtually every state-of-the-art result in natural language processing was achieved using Transformer-based models. BERT, published by Google in 2018, applied the Transformer’s encoder to produce contextual word representations that shattered records on eleven NLP benchmarks simultaneously. GPT, developed by OpenAI, applied the Transformer’s decoder to autoregressive language modeling and demonstrated that scaling up the architecture produced increasingly capable systems. By 2020, the Transformer had expanded beyond language into computer vision (Vision Transformer), protein structure prediction (AlphaFold 2), music generation, robotics, and dozens of other domains.
The Mixture of Experts contribution mattered for a different but equally profound reason. The key challenge in building more capable AI systems was not just architectural — it was economic. Training a dense model with trillions of parameters would require computational resources that no organization could afford. Shazeer’s MoE approach broke the linear relationship between model size and computational cost, making it economically feasible to train models with far more parameters than any dense model. This insight directly influenced the design of subsequent Google models and became a central architectural choice in Gemini, the model family that Google positioned as its response to GPT-4 and other frontier models.
Together, the Transformer and MoE represent a one-two punch that defined the modern era of artificial intelligence. The Transformer showed how to model sequences without the bottleneck of sequential processing. The MoE showed how to scale models without the bottleneck of proportional compute costs. Both ideas originated from or were shaped by the same engineer — a fact that underscores the extraordinary breadth of Shazeer’s technical vision.
Other Major Contributions
Google’s Ad System and Search Infrastructure
Before his work on Transformers and language models, Shazeer spent years as one of the key engineers behind Google’s advertising infrastructure. The ad system that Shazeer helped build was not a simple keyword matcher — it was a sophisticated machine learning system that needed to predict click-through rates, estimate advertiser willingness to pay, optimize auction mechanisms, and do all of this in real-time for billions of daily queries. The revenue generated by this system — hundreds of billions of dollars over the years — funded the research that eventually produced the Transformer, AlphaGo, and virtually every other Google AI project. In a very real sense, Shazeer’s engineering work on ads created the economic foundation that made modern AI research at Google possible.
Meena, LaMDA, and Conversational AI
After the Transformer paper, Shazeer turned his attention to conversational AI. He was a principal contributor to Meena, a neural conversational model published in 2020 that achieved state-of-the-art results on a metric called Sensibleness and Specificity Average (SSA). Meena demonstrated that a sufficiently large Transformer-based model, trained on enough conversational data, could engage in open-ended dialogue that was remarkably coherent and contextually appropriate. This work evolved into LaMDA (Language Model for Dialogue Applications), which Google announced in 2021 and which became the subject of intense public attention when a Google engineer claimed it was sentient — a claim that leading AI researchers including Dario Amodei and others firmly rejected, but which demonstrated the extraordinary persuasiveness of the conversational systems Shazeer had helped create.
The experience with Meena and LaMDA shaped Shazeer’s conviction that conversational AI represented the most natural and powerful interface between humans and AI systems. This conviction would drive his next major career move.
Character.AI
In November 2021, Shazeer left Google to co-found Character.AI with Daniel De Freitas, another former Google researcher who had worked on LaMDA. The premise was ambitious: build a platform where users could create and interact with AI characters powered by large language models. Character.AI quickly attracted tens of millions of users, demonstrating massive consumer demand for conversational AI experiences. The company raised significant venture capital and reached a valuation that placed it among the most valuable AI startups in the world.
Character.AI was significant not just as a business but as a proof of concept. It showed that conversational AI was not just an academic research direction but a product category with genuine consumer demand. The platform’s user engagement metrics were extraordinary — users spent an average of nearly two hours per session interacting with AI characters, far exceeding the engagement of traditional social media platforms. Managing a rapidly scaling AI startup required sophisticated coordination between research, engineering, and product teams — the kind of cross-functional project management challenge that becomes exponentially more complex as teams grow.
Then, in August 2024, Google signed a deal to license Character.AI’s technology and bring both Shazeer and De Freitas back to Google. The deal was structured not as an acquisition but as a licensing agreement, reportedly to avoid regulatory scrutiny, with Google paying approximately $2.7 billion. Shazeer returned to Google DeepMind to work on Gemini — the very model family that his Transformer and MoE innovations had made possible. The return of the Transformer’s co-creator to work on one of its most advanced descendants was a fitting capstone to a career defined by the relentless pursuit of better architectures.
Philosophy and Approach
Key Principles
Noam Shazeer’s approach to research and engineering is distinctive in ways that help explain why his contributions have been so consistently impactful.
Simplicity as a design principle. The Transformer architecture is, at its core, remarkably simple. It consists of attention mechanisms, feed-forward networks, residual connections, and layer normalization — no recurrence, no convolution, no complex gating mechanisms like those in LSTMs. This simplicity was not accidental. Shazeer has consistently favored architectures that do one thing extremely well over architectures that combine many clever mechanisms. The scaled dot-product attention in the Transformer is a single matrix multiplication followed by a softmax — elegant, parallelizable, and easy to reason about. This bias toward simplicity is what allowed the Transformer to scale so effectively: simpler architectures have fewer failure modes and fewer hyperparameters to tune.
Scale as a research methodology. Shazeer was one of the first researchers to recognize that the relationship between model scale and capability was not just quantitative but qualitative. Larger models did not just perform better on existing tasks — they acquired entirely new capabilities that smaller models did not possess. This insight, which would later be formalized as “emergent abilities” in large language models, guided both his MoE work (which made extreme scale economically feasible) and his conversational AI work at Google and Character.AI. As Jensen Huang’s GPUs made massive parallel computation accessible, Shazeer’s architectures were uniquely positioned to exploit that computational abundance.
Engineering excellence as a prerequisite for research impact. Unlike many AI researchers who focus primarily on novel ideas, Shazeer has always insisted that ideas only matter if they can be implemented efficiently and made to work at scale. His background in Google’s ad system — where latency budgets were measured in milliseconds and reliability requirements were absolute — instilled a discipline that permeates all his research. The Transformer was not just a good idea; it was a good idea that could be implemented efficiently on TPUs, trained stably at scale, and deployed in production systems. This engineering sensibility is what separated the Transformer from earlier attention-based architectures that achieved similar theoretical results but could not be scaled. Digital agencies building sophisticated AI-powered products — like teams that architect complex web and mobile solutions — recognize this same principle: that the gap between a working prototype and a production system is where engineering discipline matters most.
Collaboration over competition. The “Attention Is All You Need” paper had eight authors, each contributing different essential components. Shazeer’s willingness to collaborate — to contribute his ideas to a shared project rather than pursuing sole authorship — was instrumental in the Transformer’s success. The same collaborative instinct characterized his work on MoE, where he built on ideas from decades of machine learning research, and his work on conversational AI, where he partnered with De Freitas to build Character.AI. In a field where individual recognition is intensely valued, Shazeer’s collaborative approach stands out.
Legacy and Impact
Noam Shazeer’s legacy is unusual in that it encompasses two of the most important architectural innovations in the history of artificial intelligence — the Transformer and the modern Mixture of Experts — while remaining relatively unknown to the general public. This asymmetry between impact and recognition is itself revealing. The Transformer architecture is the foundation of virtually every major AI system deployed today: GPT-4, Claude, Gemini, LLaMA, Mistral, and dozens of others. Every time someone uses a language model to write an email, translate a document, or generate code, they are using technology that Shazeer co-created. The Mixture of Experts approach has become the standard method for scaling language models beyond what dense architectures can achieve economically, and it is a central architectural choice in the most capable models being trained today.
Beyond his architectural contributions, Shazeer’s career demonstrates a particular kind of impact that is rare in technology: the impact of an engineer who operates at the intersection of theoretical insight and practical implementation. He did not merely propose the Transformer as a theoretical construct — he helped build, train, and validate it on real-world tasks. He did not merely describe MoE as a scaling strategy — he implemented the sparsely-gated layer, solved the load-balancing problems, and demonstrated that it worked in practice. This combination of conceptual depth and engineering execution is what made his contributions so durable.
The conversational AI systems he helped create at Google and Character.AI demonstrated that large language models could engage in open-ended dialogue with a naturalness that previous approaches could not achieve. This work directly influenced the design of every major conversational AI system that followed, from ChatGPT to Claude to Gemini.
Shazeer’s return to Google to work on Gemini represents a remarkable arc. The engineer who co-created the fundamental architecture of modern AI is now working on one of its most advanced instantiations. With the Transformer, the MoE, and deep experience in both research and production systems, Shazeer is uniquely positioned to push the frontier of what AI systems can achieve. His career is a reminder that in technology, the most transformative contributions often come not from those who seek attention but from those who seek solutions — and that a single well-chosen tool or architecture, in the hands of a skilled engineer, can change the world.
Key Facts
- Full name: Noam Shazeer
- Education: B.S. Mathematics, Duke University
- Known for: Co-authoring “Attention Is All You Need” (Transformer architecture), inventing the Sparsely-Gated Mixture of Experts layer, co-founding Character.AI
- Key positions: Google Brain researcher (2000s–2021), Co-founder and CEO of Character.AI (2021–2024), Google DeepMind / Gemini team (2024–present)
- Key publications: “Attention Is All You Need” (2017), “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (2017), “Meena: Towards a Human-like Open-Domain Chatbot” (2020), “LaMDA: Language Models for Dialog Applications” (2022)
- Character.AI: Co-founded with Daniel De Freitas in 2021; licensed to Google in 2024 in a deal valued at approximately $2.7 billion
- Other contributions: Google ad-serving system architecture, Switch Transformer (scaling MoE to trillions of parameters), Adafactor optimizer
- Impact: The Transformer architecture he co-created is the foundation of GPT, BERT, Claude, Gemini, LLaMA, and virtually every modern large language model
Frequently Asked Questions
What is Noam Shazeer’s role in creating the Transformer architecture?
Noam Shazeer was one of eight co-authors of the 2017 paper “Attention Is All You Need,” which introduced the Transformer architecture. His contributions were particularly central to the design of the attention mechanism itself — the scaled dot-product attention and multi-head attention that form the core of the architecture. Shazeer brought to the project years of experience building large-scale machine learning systems at Google, and his engineering intuition was instrumental in designing an architecture that was not only theoretically elegant but also practical to train at scale on Google’s Tensor Processing Units (TPUs). The Transformer replaced recurrent neural networks as the dominant architecture for sequence modeling and became the foundation of all modern large language models, including GPT-4, Claude, Gemini, and others.
What is the Mixture of Experts approach and why does it matter for modern AI?
The Mixture of Experts (MoE) approach, which Shazeer developed in his 2017 paper on sparsely-gated MoE layers, is a technique for scaling neural networks to very large sizes without proportionally increasing the computational cost of each forward pass. The key idea is to divide the model into many specialized “expert” sub-networks and use a learned gating function to route each input to only a small subset of experts. This means a model can have trillions of parameters — giving it enormous capacity to learn complex patterns — while only activating a fraction of those parameters for any given input. This approach directly influenced the design of Google’s Switch Transformer and is a central architectural choice in the Gemini model family. MoE has become the standard technique for training the largest and most capable language models, because it breaks the otherwise prohibitive linear relationship between model size and computational cost.
Why did Noam Shazeer leave Google, start Character.AI, and then return to Google?
Shazeer left Google in late 2021, reportedly frustrated that the company was not moving quickly enough to deploy the conversational AI technology he had helped create — particularly LaMDA. He co-founded Character.AI with Daniel De Freitas to build a platform where users could interact with AI characters powered by large language models. The startup was enormously successful, attracting tens of millions of users and raising significant venture capital. However, in August 2024, Google signed a licensing deal for Character.AI’s technology and brought Shazeer back to Google DeepMind to work on Gemini. The deal, valued at approximately $2.7 billion, was structured as a licensing agreement rather than an acquisition. Shazeer’s return to Google reunited the co-creator of the Transformer with one of the most ambitious projects built on that architecture, and it reflected Google’s recognition that Shazeer’s unique combination of architectural insight and engineering expertise was essential for competing at the frontier of AI development.