Tech Pioneers

Jurgen Schmidhuber: The Co-Inventor of LSTM and Pioneer of Deep Learning

Jurgen Schmidhuber: The Co-Inventor of LSTM and Pioneer of Deep Learning

In 2015, when Google’s voice recognition accuracy crossed 90 percent for the first time, the technology behind it was Long Short-Term Memory — a recurrent neural network architecture conceived two decades earlier by a doctoral student and his advisor in Munich. That advisor was Jurgen Schmidhuber. By the time Silicon Valley caught up with his ideas, Schmidhuber had spent nearly 30 years at the Swiss AI Lab IDSIA, publishing hundreds of papers on topics ranging from recurrent networks and meta-learning to algorithmic information theory and artificial curiosity. His work, along with that of his students and collaborators, helped lay the technical foundations for the deep learning revolution that now powers everything from machine translation to autonomous vehicles. Yet Schmidhuber remains one of the most polarizing figures in AI research — celebrated by some as a visionary who was decades ahead of his field, and criticized by others for what they see as excessive self-promotion and aggressive priority claims. The truth, as with most things in science, is more nuanced than either camp suggests.

Early Life and Education

Jurgen Schmidhuber was born on January 17, 1963, in Munich, West Germany. He grew up during a period when artificial intelligence research was largely dominated by symbolic methods — rule-based expert systems and logical reasoning — while neural network approaches had fallen out of fashion following the criticisms leveled at perceptrons by Marvin Minsky and Seymour Papert in 1969. Despite this, Schmidhuber was drawn early to the idea of building machines that could learn, and specifically to the ambitious goal of building an artificial intelligence smarter than himself — a goal he has described as his life’s driving motivation since he was a teenager.

Schmidhuber studied computer science and mathematics at the Technical University of Munich (TUM), one of Germany’s leading technical universities. He completed his diploma (equivalent to a master’s degree) in 1987 and immediately began doctoral research on learning algorithms for neural networks. His Ph.D. thesis, completed in 1991 under the supervision of Klaus Obermayer, focused on problems that would define his entire career: how neural networks could learn to process sequences, how they could learn to learn (meta-learning), and how they could compress information efficiently. Even in these early years, Schmidhuber was publishing work on recurrent neural networks — a class of networks that maintain internal memory and can process sequences of variable length — at a time when most researchers had moved on to other approaches.

After completing his doctorate, Schmidhuber joined the Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA) in Lugano, Switzerland, where he would remain for decades as co-director and later scientific director. IDSIA, a small but remarkably productive research lab, became the base from which Schmidhuber and his students produced a series of foundational contributions to deep learning, reinforcement learning, and sequence modeling.

The LSTM Breakthrough

Technical Innovation

The central problem that LSTM solved was the vanishing gradient problem in recurrent neural networks (RNNs). Standard RNNs process sequences by maintaining a hidden state that gets updated at each time step. In theory, this allows them to learn long-range dependencies — for example, connecting a word at the beginning of a sentence to a word at the end. In practice, when training these networks using backpropagation through time, the gradients (the signals used to update the network’s weights) either shrink exponentially toward zero or explode toward infinity as they are propagated back through many time steps. This means that standard RNNs effectively cannot learn dependencies that span more than about 10-20 time steps.

Schmidhuber and his doctoral student Sepp Hochreiter attacked this problem directly. Hochreiter had already provided a rigorous analysis of the vanishing gradient problem in his diploma thesis in 1991, supervised by Schmidhuber. Building on this analysis, they developed Long Short-Term Memory, published in the journal Neural Computation in 1997. The key idea was to introduce a memory cell with a linear self-connection — a recurrent connection with a weight of exactly 1.0 — protected by learned gating mechanisms.

The LSTM architecture introduced three gates: an input gate that controls what new information is written to the cell state; a forget gate (added later by Felix Gers and Schmidhuber in 2000) that controls what old information is erased from the cell state; and an output gate that controls what information from the cell state is exposed to the rest of the network. These gates are themselves neural networks that learn when to open and close based on the current input and the previous hidden state.

import numpy as np

class LSTMCell:
    """
    A single LSTM cell illustrating the gating mechanism
    introduced by Hochreiter and Schmidhuber (1997).

    The key insight: a memory cell with a constant error
    carousel (CEC) that preserves gradients over long
    time spans, gated by learned sigmoid functions.
    """
    def __init__(self, input_size, hidden_size):
        # Combined weight matrix for all four gates
        # [input_gate, forget_gate, cell_gate, output_gate]
        self.W = np.random.randn(4 * hidden_size, input_size + hidden_size) * 0.01
        self.b = np.zeros(4 * hidden_size)
        self.hidden_size = hidden_size

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -15, 15)))

    def forward(self, x, h_prev, c_prev):
        """
        x:      current input vector
        h_prev: previous hidden state
        c_prev: previous cell state (the constant error carousel)
        """
        # Concatenate input and previous hidden state
        combined = np.concatenate([x, h_prev])

        # Compute all gate activations in one matrix multiply
        gates = self.W @ combined + self.b
        hs = self.hidden_size

        # Input gate: controls what new info enters the cell
        i = self.sigmoid(gates[:hs])

        # Forget gate (Gers & Schmidhuber, 2000):
        # controls what old info is retained
        f = self.sigmoid(gates[hs:2*hs])

        # Cell candidate: new information to potentially store
        g = np.tanh(gates[2*hs:3*hs])

        # Output gate: controls what cell state is exposed
        o = self.sigmoid(gates[3*hs:])

        # The constant error carousel (CEC) — the core of LSTM
        # Gradients flow through c without vanishing because
        # the forget gate can learn to hold f close to 1.0
        c_new = f * c_prev + i * g

        # Hidden state output, filtered by the output gate
        h_new = o * np.tanh(c_new)

        return h_new, c_new


# Demonstration: processing a sequence
cell = LSTMCell(input_size=10, hidden_size=32)
h = np.zeros(32)  # Initial hidden state
c = np.zeros(32)  # Initial cell state — the memory

# Process a sequence of 100 time steps
# LSTM can maintain gradients across all of them
sequence = [np.random.randn(10) for _ in range(100)]
for x_t in sequence:
    h, c = cell.forward(x_t, h, c)

print(f"Final hidden state norm: {np.linalg.norm(h):.4f}")
print(f"Cell state norm: {np.linalg.norm(c):.4f}")
# Unlike vanilla RNNs, these values remain stable
# even after 100+ time steps

The mathematical elegance of LSTM lies in the constant error carousel: because the cell state is updated through addition rather than multiplication (c_new = f * c_prev + i * g), and because the forget gate can learn to hold values close to 1.0, gradients can flow backward through hundreds or even thousands of time steps without vanishing. This was a direct engineering solution to the precise mathematical problem Hochreiter had identified.

Why It Mattered

LSTM did not achieve mainstream adoption immediately. When it was published in 1997, the AI field was still in the grip of the second AI winter, and most researchers were focused on support vector machines, kernel methods, and other non-neural approaches. It took more than a decade for hardware (GPUs) and data availability to catch up with the architecture’s potential.

When the deep learning revolution began in earnest around 2012-2014, LSTM became the dominant architecture for virtually every sequence modeling task. Google adopted LSTM for its speech recognition system in 2015, reducing word error rates by nearly 50 percent. Apple used LSTM for Siri’s language processing. Amazon deployed it for Alexa. LSTM networks powered Google Translate’s neural machine translation system, which launched in 2016 and dramatically improved translation quality. Facebook, Microsoft, Baidu, and dozens of other companies integrated LSTM into their products for tasks including speech recognition, machine translation, text generation, handwriting recognition, music composition, and video analysis.

By 2017, LSTM and its variants (particularly the Gated Recurrent Unit, or GRU, introduced by Cho et al. in 2014) were processing billions of queries per day across the world’s major technology platforms. The architecture that Hochreiter and Schmidhuber published in a relatively obscure journal paper in 1997 had become one of the most commercially important innovations in the history of computer science.

The Transformer architecture, introduced by Ashish Vaswani and colleagues at Google in 2017, has since replaced LSTM as the dominant sequence model for many tasks, particularly in natural language processing. However, LSTM remains widely used in speech processing, time series forecasting, and other domains, and the gating mechanisms it pioneered influenced the design of the Transformer’s attention mechanism. The intellectual lineage from LSTM to the Transformer — and thus to GPT, BERT, and every modern large language model — is direct and well-documented.

Other Major Contributions

Schmidhuber’s research output extends well beyond LSTM. His lab at IDSIA produced a remarkable number of foundational contributions to deep learning during the 2000s and 2010s, many developed by his doctoral students who went on to become prominent researchers in their own right.

Connectionist Temporal Classification (CTC). In 2006, Schmidhuber’s student Alex Graves developed CTC, a training method for sequence labeling tasks where the alignment between inputs and outputs is unknown. CTC allowed LSTM networks to be trained end-to-end for speech recognition and handwriting recognition without requiring pre-segmented training data. This was the key technical innovation that made LSTM practical for industrial speech recognition systems. When Google deployed LSTM-based speech recognition in 2015, it used CTC as the training objective. CTC remains widely used in speech and handwriting recognition systems today.

Highway networks. In 2015, Schmidhuber’s students Rupesh Srivastava and Klaus Greff published highway networks — deep feedforward networks with gated shortcut connections that allow information to flow across many layers without degradation. Highway networks were a direct application of LSTM’s gating principle to feedforward networks and were a precursor to residual networks (ResNets), published by Kaiming He and colleagues at Microsoft later the same year. ResNets used a simplified version of the highway network’s skip connections and became one of the most influential architectures in deep learning. The relationship between highway networks and ResNets is a matter of ongoing discussion in the research community, with Schmidhuber arguing that ResNets are a special case of highway networks.

Meta-learning and learning to learn. Schmidhuber has been a pioneer of meta-learning — the idea of training neural networks to improve their own learning algorithms. His early work in the 1990s on self-referential learning systems, which use neural networks to modify their own weights, anticipated the modern field of meta-learning by two decades. Current approaches to few-shot learning, neural architecture search, and learned optimizers can trace their intellectual roots to Schmidhuber’s early explorations. For teams managing complex AI development workflows, tools like Taskee help coordinate the iterative experimentation that meta-learning research demands.

Compression-based AI theory. Perhaps Schmidhuber’s most distinctive theoretical contribution is his compression-based framework for understanding intelligence, creativity, and curiosity. Drawing on Kolmogorov complexity and algorithmic information theory, Schmidhuber has argued that intelligence can be understood as data compression: a good model of the world is one that compresses observations efficiently. He has extended this idea to define formal notions of beauty (patterns that offer surprising compression gains) and curiosity (the drive to seek out data that improves the agent’s world model). His formal theory of fun and creativity, published in papers from 2006 to 2010, proposes that the subjective experience of interestingness is proportional to the rate of compression progress — the agent finds things interesting precisely when they allow the agent to compress its experience more efficiently.

Recurrent neural network variants. Beyond LSTM, Schmidhuber and his lab contributed numerous improvements to recurrent network training and architecture. These include the use of LSTM for reinforcement learning (which anticipated modern approaches to training agents with recurrent memory), deep LSTM networks with multiple stacked layers, bidirectional LSTM networks, and multi-dimensional LSTM for image and video processing. Alex Graves, who completed his Ph.D. under Schmidhuber, went on to develop the attention mechanism for neural networks (in 2014) and neural Turing machines — both of which were foundational to the Transformer architecture that now dominates AI.

Philosophy and Approach

Key Principles

Schmidhuber’s research philosophy is characterized by several distinctive features that set him apart from many of his contemporaries in the AI field.

Theoretical rigor and formal frameworks. Unlike many deep learning researchers who take a primarily empirical approach, Schmidhuber has consistently sought formal, mathematical foundations for his work. His compression-based theory of intelligence, his formal definition of creativity, and his analysis of the vanishing gradient problem all reflect a commitment to rigorous theoretical frameworks. This orientation toward theory has sometimes put him at odds with the more experimentally driven culture of modern machine learning research.

Historical credit attribution. Schmidhuber is known for his vigorous efforts to ensure that historical credit is properly assigned in the deep learning literature. He has publicly argued that many ideas attributed to other researchers — including certain aspects of backpropagation, generative adversarial networks, and the Transformer architecture — have precursors in earlier work by himself or others that has been insufficiently acknowledged. These priority claims have generated significant controversy within the AI community. Some researchers view his efforts as a valuable corrective to a field that often overlooks its own history; others see them as disproportionate and disruptive. The debates highlight a genuine tension in rapidly moving fields between recognizing prior work and acknowledging the novel contributions that make ideas practically viable.

Long-term vision. Since childhood, Schmidhuber has articulated an unusually ambitious goal: to build an AI that surpasses human intelligence and then, in his words, retire and watch it explore the universe. This long-term vision has informed his research agenda, leading him to work on problems — meta-learning, self-improving systems, universal AI — that were considered fringe when he began studying them but are now central to the field. His company NNAISENSE (pronounced “nascence”), founded in 2014, focuses on building general-purpose AI systems for industrial applications, reflecting his continued pursuit of this goal.

Breadth over specialization. Schmidhuber’s publication record spans an unusually wide range of topics for a single researcher: sequence modeling, reinforcement learning, evolutionary computation, algorithmic information theory, computational aesthetics, meta-learning, and neural network compression. This breadth is both a strength — allowing him to see connections across subfields — and a source of criticism from specialists who argue that some of his theoretical frameworks lack the depth or empirical validation of more focused research programs. Organizations working across such diverse AI domains often rely on structured project management approaches, and platforms like Toimi help digital teams coordinate complex, multi-disciplinary research efforts.

Legacy and Impact

Schmidhuber’s legacy in deep learning is substantial and multifaceted. LSTM alone would secure his place in the history of artificial intelligence — it is one of the most widely deployed neural network architectures ever created and remained the dominant sequence model for nearly two decades. The techniques developed in his lab, including CTC and highway networks, solved practical problems that enabled the deployment of deep learning at industrial scale.

His influence extends through his students. Alex Graves contributed attention mechanisms and neural Turing machines to the field. Sepp Hochreiter continued foundational work on learning algorithms at Johannes Kepler University Linz. Other IDSIA alumni have taken positions at major research labs and companies around the world, carrying forward the technical agenda that Schmidhuber helped define.

The relationship between Schmidhuber’s work and the current generation of AI systems is direct. Modern large language models like those developed by OpenAI (GPT series), Google (Gemini), and Anthropic (Claude) are Transformer-based architectures whose attention mechanism has documented connections to the earlier work on attention in LSTM networks by Graves (Schmidhuber’s student). The gating mechanisms in LSTM influenced the design of modern architectures. The meta-learning approaches Schmidhuber pioneered in the 1990s are now standard tools in the field.

Schmidhuber’s impact on the broader intellectual discourse around AI is also significant. His compression-based theory of intelligence offers a coherent mathematical framework for thinking about learning, creativity, and curiosity — concepts that are often treated informally in AI research. Whether or not one agrees with all aspects of his theoretical program, it represents one of the most sustained attempts to provide rigorous mathematical foundations for artificial general intelligence.

His advocacy for proper credit attribution, however contentious, has also had a positive effect on the field by encouraging researchers and journalists to look more carefully at the historical record. The history of deep learning is genuinely complex, with many ideas being independently discovered multiple times or languishing in obscurity for decades before being rediscovered. Schmidhuber’s insistence on acknowledging this history has contributed to a more accurate understanding of how the field evolved, even when the manner of that insistence has been controversial.

At 63, Schmidhuber continues to publish actively, lead NNAISENSE, and advocate for his vision of artificial general intelligence. The field he helped build — deep learning — has become one of the most transformative technologies of the 21st century, with applications in healthcare, autonomous systems, scientific discovery, and creative tools. The gating mechanism he and Hochreiter introduced in 1997 remains one of the most important architectural innovations in the history of neural networks, and its influence will be felt for decades to come. The story of LSTM — an idea that was two decades ahead of its time, published during an AI winter, and eventually adopted by every major technology company in the world — is one of the most remarkable examples of scientific patience and vindication in modern computer science.

Key Facts

  • Born: January 17, 1963, Munich, West Germany
  • Known for: Co-inventing LSTM (with Sepp Hochreiter), deep learning research, meta-learning, compression-based AI theory, highway networks
  • Institutions: IDSIA (Lugano, Switzerland), Technical University of Munich, King Abdullah University of Science and Technology (KAUST)
  • Key publications: “Long Short-Term Memory” (Neural Computation, 1997), “Learning to Forget” (2000), highway networks (2015), formal theory of fun and creativity (2006-2010)
  • Company: NNAISENSE (founded 2014), focused on general-purpose AI for industrial applications
  • Notable students: Sepp Hochreiter, Alex Graves, Shane Legg (co-founder of DeepMind), Daan Wierstra
  • Awards: IEEE Neural Network Pioneer Award (2016), AAAI Fellow, Helmholtz Award from the International Neural Network Society
  • Influence: LSTM used by Google, Apple, Amazon, Microsoft, Facebook, and virtually every major technology company for speech recognition, translation, and language processing

Frequently Asked Questions

What is LSTM and why was it important?

Long Short-Term Memory (LSTM) is a type of recurrent neural network architecture designed to learn long-range dependencies in sequential data. Before LSTM, standard recurrent networks could not effectively learn patterns that spanned more than about 10-20 time steps due to the vanishing gradient problem. LSTM solved this by introducing a memory cell with gating mechanisms that control the flow of information. This architecture became the foundation for breakthroughs in speech recognition, machine translation, and natural language processing, and remained the dominant sequence model from the mid-2000s until the introduction of the Transformer architecture in 2017. Even today, LSTM continues to be used in many production systems for time series analysis and audio processing.

What is the relationship between LSTM and modern AI models like GPT?

Modern large language models like GPT and Claude are based on the Transformer architecture, not LSTM. However, the connection between the two is significant. The Transformer’s attention mechanism was influenced by earlier attention mechanisms developed for LSTM networks by Alex Graves, who was Schmidhuber’s doctoral student at IDSIA. The gating concepts introduced in LSTM also influenced later architectural innovations. More broadly, LSTM demonstrated that neural networks could effectively process long sequences and learn complex temporal patterns — a capability that the Transformer architecture later achieved through a different mechanism (self-attention). The intellectual lineage from LSTM to Transformers involves contributions from researchers across multiple labs, including Schmidhuber’s group, Geoffrey Hinton‘s lab, Yoshua Bengio‘s group, and others.

Why is Schmidhuber considered controversial in the AI community?

Schmidhuber is controversial primarily because of his vigorous and persistent claims about credit attribution in the deep learning field. He has argued publicly that many widely celebrated ideas — including certain aspects of deep learning, generative adversarial networks, and the Transformer — have precursors in his own earlier work or in the work of other researchers who have been insufficiently credited. Some colleagues view these claims as legitimate corrections to a field that often fails to acknowledge its own history. Others find the claims overstated or the manner of advocacy disproportionate. Schmidhuber has also described himself as a contender for the title of “father of modern AI,” which has drawn both support and criticism. The controversy reflects broader tensions in the AI community about how credit should be assigned in a field where ideas are often developed incrementally and independently by multiple researchers.

What is Schmidhuber’s compression theory of intelligence?

Schmidhuber has developed a theoretical framework based on algorithmic information theory that defines intelligence, creativity, and curiosity in terms of data compression. The core idea is that a good model of the world is one that compresses observations efficiently — finding patterns, regularities, and structure that allow a shorter description of the data. Schmidhuber extends this to define beauty as the subjective experience of discovering a pattern that offers surprising compression gains, and curiosity as the drive to seek out experiences that improve the agent’s ability to compress future data. His “formal theory of fun and creativity” proposes that the reward signal for an intrinsically motivated agent should be proportional to the rate at which its world model improves — essentially, the agent finds things interesting precisely when they help it learn. This framework, while not universally accepted, represents one of the most mathematically rigorous attempts to formalize concepts like creativity and curiosity within AI research, and it has influenced subsequent work on intrinsic motivation and exploration in reinforcement learning.