Andy Pavlo: The Database Researcher Shaping the Future of Data Systems

In an era when most database researchers content themselves with incremental optimizations, Andy Pavlo dared to ask a radical question: what if databases could tune and manage themselves entirely without human intervention? As an Associate Professor of Databaseology at Carnegie Mellon University and co-founder of OtterTune, Pavlo has spent over a decade pushing the boundaries of what database management systems can do — from pioneering self-driving database architectures to building open educational resources that have taught thousands of engineers worldwide. His work sits at the intersection of systems engineering, machine learning, and practical database administration, and it has fundamentally altered how the industry thinks about the future of data infrastructure.

Early Life and Education

Andy Pavlo grew up in Baltimore, Maryland, where he developed an early interest in computing. He pursued his undergraduate education at the Rochester Institute of Technology, earning a Bachelor of Science in Computer Science in 2005, followed by a Master of Science degree in 2006. During this formative period, Pavlo gained practical systems experience working as a programmer on the HTCondor project at the University of Wisconsin–Madison Computer Sciences Department under Miron Livny. HTCondor is a high-throughput computing framework used for managing large-scale distributed workloads, and this early exposure to systems-level thinking would prove foundational for his later research.

Pavlo then enrolled at Brown University for his doctoral studies, where he worked under the supervision of two towering figures in database research: Stanley Zdonik and Michael Stonebraker, the Turing Award–winning creator of Ingres and PostgreSQL. Under their mentorship, Pavlo focused his dissertation on the optimization of partitioned database management systems — specifically, the H-Store project. H-Store was a revolutionary in-memory, distributed transaction processing system designed to eliminate the overhead of traditional disk-based architectures. The system demonstrated that by removing legacy assumptions about storage and concurrency, transaction throughput could be improved by orders of magnitude.

His dissertation, titled “On Scalable Transaction Execution in Partitioned Main Memory Database Management Systems,” earned him the prestigious ACM SIGMOD Jim Gray Doctoral Dissertation Award in 2014 — one of the highest honors a young database researcher can receive. H-Store was later commercialized as VoltDB, validating the practical applicability of Pavlo’s academic work.

Career and Technical Contributions

After completing his PhD, Pavlo joined Carnegie Mellon University in 2013 as an assistant professor in the Computer Science Department. He quickly established the CMU Database Group as one of the most productive and influential database research labs in the world. His appointment carried an unusual and self-styled distinction: Associate Professor of Databaseology — a title that reflects both his deep commitment to the field and his willingness to bring personality and humor to academic culture.

Technical Innovation

Pavlo’s most significant technical contribution is the concept of the self-driving database management system. This research program, which has spanned multiple projects and nearly a decade of work, aims to create database systems that can automatically optimize their own configuration, physical design, and runtime behavior without any human database administrator involvement.

The first major project in this line of research was Peloton, a relational DBMS designed from scratch for autonomous operation. Peloton explored how machine learning models could be embedded directly into the database engine to predict workload patterns and automatically adjust indexes, materialized views, and knob configurations. While Peloton served as an important proof of concept, Pavlo and his team recognized limitations in their initial approach and made the bold decision to start over with a cleaner architecture.

The result was NoisePage, a next-generation self-driving DBMS built with three core ML-driven components: workload forecasting, behavior modeling, and action planning. NoisePage features a PostgreSQL-compatible wire protocol and catalogs, lock-free multi-version concurrency control (MVCC), and just-in-time query compilation using LLVM. The architecture enables the system to predict future query patterns, model how configuration changes affect performance, and autonomously plan and execute optimization actions.

Here is a simplified example of how NoisePage’s forecasting component might model an incoming workload pattern to predict resource demands:

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor

class WorkloadForecaster:
    """Predicts future query arrival rates and resource demands
    using historical workload telemetry from the DBMS."""

    def __init__(self, horizon=60, features=None):
        self.horizon = horizon  # forecast window in seconds
        self.model = GradientBoostingRegressor(
            n_estimators=200,
            max_depth=5,
            learning_rate=0.05
        )
        self.feature_names = features or [
            'query_rate_1m', 'query_rate_5m', 'query_rate_15m',
            'write_ratio', 'index_hit_ratio', 'buffer_pool_usage',
            'hour_of_day', 'day_of_week', 'is_peak_window'
        ]

    def extract_features(self, telemetry_window):
        """Extract time-series features from raw DBMS telemetry."""
        features = {
            'query_rate_1m': np.mean(telemetry_window[-60:]),
            'query_rate_5m': np.mean(telemetry_window[-300:]),
            'query_rate_15m': np.mean(telemetry_window[-900:]),
            'write_ratio': self._compute_write_ratio(telemetry_window),
            'hour_of_day': self._current_hour_encoded(),
            'day_of_week': self._current_dow_encoded(),
        }
        return np.array([features[f] for f in self.feature_names])

    def predict_demand(self, current_telemetry):
        """Return predicted queries/sec for the next horizon window."""
        X = self.extract_features(current_telemetry).reshape(1, -1)
        return self.model.predict(X)[0]

In parallel with the self-driving database research, Pavlo co-founded OtterTune in 2020, a startup that took a complementary “black-box” approach to the same problem. Rather than building a new DBMS from scratch, OtterTune used machine learning to automatically tune the configuration knobs of existing database systems like PostgreSQL, MySQL, and Amazon RDS. The system would observe how a database responded to various configurations, build performance models, and iteratively adjust parameters to minimize latency and reduce costs.

OtterTune raised $12 million in Series A funding from Intel Capital, Race Capital, and Accel, and grew revenue to $2.2 million by 2023. However, the company shut down in June 2024 after an acquisition deal fell through — a painful but instructive experience that Pavlo has discussed openly, embodying the transparency that characterizes his approach to both research and entrepreneurship.

Why It Mattered

The significance of Pavlo’s self-driving database research extends far beyond academic novelty. Database administration has long been one of the most labor-intensive and expertise-dependent roles in enterprise computing. Organizations spend millions on skilled DBAs to tune configurations, design indexes, and troubleshoot performance issues. As data volumes grow and cloud-native architectures become the norm, the demand for automated database management has become critical.

Pavlo’s work demonstrated that ML-driven automation could handle tasks that previously required years of specialized experience. This idea has influenced major cloud providers, with Amazon, Google, and Microsoft all incorporating automated tuning capabilities into their managed database offerings. The research also inspired a broader movement in the database community toward what Pavlo calls “learned” database components — replacing hand-crafted heuristics with models trained on actual workload data. Engineers who work on modern data infrastructure projects, whether using tools tracked on platforms like Taskee or managing complex deployment pipelines, increasingly rely on the kind of ML-driven optimizations that Pavlo pioneered.

Other Notable Contributions

Beyond self-driving databases, Pavlo has made substantial contributions across multiple areas of database research and practice.

OLTP-Bench and BenchBase: Pavlo co-authored OLTP-Bench, a widely-used extensible benchmarking framework for relational databases. The tool bundles fifteen diverse workloads — from TPC-C to Wikipedia-derived traces — and supports all major database systems. OLTP-Bench became the standard benchmarking tool in both academic research and industry evaluation. It has since been modernized and maintained as BenchBase, which continues to serve as the go-to framework for rigorous database performance testing.

Database of Databases (DBDB): Pavlo created and maintains dbdb.io, a comprehensive online encyclopedia cataloging over 1,000 database management systems. Each entry includes technical details about the system’s architecture, data model, query language, and storage engine. This resource has become an invaluable reference for researchers, students, and practitioners navigating the increasingly fragmented database landscape — much like how Richard Hipp’s SQLite documentation serves as a model of technical clarity.

Database Education: Pavlo’s CMU courses — 15-445 (Introduction to Database Systems) and 15-721 (Advanced Database Systems) — are among the most popular database courses globally. All lectures, notes, and slides are open-source and freely available on YouTube. These courses have been viewed millions of times and have become the de facto resource for engineers wanting to understand database internals. His teaching combines rigorous academic content with engaging delivery, making complex topics like MVCC, query optimization, and storage engines accessible to a broad audience.

A typical configuration for BenchBase demonstrates how researchers can precisely control benchmarking workloads across multiple database systems:

<!-- BenchBase configuration for TPC-C benchmark on PostgreSQL -->
<parameters>
  <dbtype>POSTGRES</dbtype>
  <driver>org.postgresql.Driver</driver>
  <url>jdbc:postgresql://localhost:5432/benchbase</url>
  <username>benchuser</username>
  <isolation>TRANSACTION_SERIALIZABLE</isolation>

  <scalefactor>10</scalefactor>
  <terminals>64</terminals>
  <time>300</time> <!-- 5-minute run -->

  <works>
    <work>
      <!-- TPC-C standard transaction mix -->
      <weights>45,43,4,4,4</weights>
      <!-- NewOrder, Payment, OrderStatus, Delivery, StockLevel -->
      <rate>unlimited</rate>
    </work>
  </works>

  <transactiontypes>
    <transactiontype>
      <name>NewOrder</name>
    </transactiontype>
    <transactiontype>
      <name>Payment</name>
    </transactiontype>
    <transactiontype>
      <name>OrderStatus</name>
    </transactiontype>
    <transactiontype>
      <name>Delivery</name>
    </transactiontype>
    <transactiontype>
      <name>StockLevel</name>
    </transactiontype>
  </transactiontypes>
</parameters>

Annual Database Retrospectives: Since 2019, Pavlo has published detailed annual reviews of the database industry, covering new system releases, funding rounds, acquisitions, and technical trends. These blog posts have become required reading for anyone tracking the evolution of the database market and are frequently cited in industry analyses.

Philosophy and Key Principles

Pavlo’s approach to database research and education is guided by several distinctive principles that set him apart in the academic world.

Radical Openness: Pavlo publishes all course materials, lecture recordings, and much of his research code as open source. He believes that database knowledge should be universally accessible, not locked behind institutional walls. This philosophy extends to his Database of Databases project and his willingness to openly discuss failures, including the shutdown of OtterTune.

Systems Over Theory: While deeply grounded in theoretical foundations, Pavlo consistently emphasizes building real, working systems. His research group produces functional prototypes, not just papers. This systems-first mentality traces back to his training under Stonebraker, who famously advocated for the importance of system building in database research.

Honest Assessment: Pavlo is known for his candid evaluations of database technology, including high-profile critiques of blockchain-based databases and overhyped distributed systems. His annual retrospectives provide unvarnished assessments of industry trends, earning him respect for intellectual honesty even when his views are controversial.

Bridging Academia and Industry: Few researchers move as fluidly between academic research and practical application as Pavlo. His OtterTune startup directly commercialized his lab’s research, and his courses are designed to produce engineers who understand both the theory and practice of building database systems. Teams across the tech industry — including those coordinating complex technical projects through platforms like Toimi — benefit from the kind of practical, systems-oriented thinking that Pavlo champions in his educational work.

Intellectual Humility: Pavlo’s decision to abandon Peloton and start NoisePage from scratch demonstrated a willingness to admit when an approach is not working. In academia, where sunk-cost fallacies can persist for decades, this kind of pragmatic course correction is rare and valuable.

Legacy and Impact

Andy Pavlo’s impact on the database field operates at multiple levels. His research on self-driving databases has established an entirely new subfield, inspiring dozens of research groups worldwide to explore ML-driven database optimization. The concept has moved from academic novelty to industry reality, with major cloud providers now offering automated tuning as a standard feature of their managed database services.

His educational contributions are arguably even more far-reaching. The freely available CMU database courses have democratized access to graduate-level database education, training a generation of engineers who now build and maintain critical data infrastructure at companies worldwide. In this sense, Pavlo’s role parallels that of figures like Andrew Ng in AI education — making elite-level technical knowledge available to anyone with an internet connection.

The tools Pavlo has built — BenchBase, DBDB, and the open OtterTune research — have become foundational infrastructure for the database research community. BenchBase is used in virtually every major database performance study, and DBDB serves as the authoritative catalog of the world’s database systems.

Pavlo’s influence on the next generation of database researchers is perhaps his most lasting contribution. His students and collaborators have gone on to positions at major technology companies and research institutions, carrying forward the systems-building tradition that he inherited from Stonebraker and that traces back through the lineage of Jim Gray and the Berkeley database group. His work on transaction processing connects him to a tradition that includes pioneers like Larry Ellison, who commercialized relational databases, and Michael “Monty” Widenius, who democratized them through open source.

At a time when data systems are becoming increasingly complex and the demand for skilled database engineers far outstrips supply, Pavlo’s dual mission — advancing the state of the art in automated database management while simultaneously educating the next generation of practitioners — positions him as one of the most consequential database researchers of his generation.

Key Facts

Detail	Information
Full Name	Andrew Pavlo
Born	Baltimore, Maryland, USA
Education	B.S. & M.S., Rochester Institute of Technology; Ph.D., Brown University
Current Position	Associate Professor of Databaseology, Carnegie Mellon University
PhD Advisors	Stanley Zdonik and Michael Stonebraker
Key Projects	H-Store, Peloton, NoisePage, OtterTune, BenchBase, DBDB
Notable Awards	ACM SIGMOD Jim Gray Dissertation Award (2014), Sloan Fellowship (2018), NSF CAREER Award (2019), VLDB Early Career Research Award (2021)
Startup	OtterTune (2020–2024), raised $12M Series A
Known For	Self-driving databases, database benchmarking, open database education

Frequently Asked Questions

What is a self-driving database, and why did Andy Pavlo pioneer this concept?

A self-driving database is a database management system that uses machine learning to automatically manage its own configuration, physical design (indexes, materialized views), and runtime optimization without human intervention. Pavlo pioneered this concept because traditional database administration requires extensive expertise and manual effort — skilled DBAs must constantly monitor performance, tune configuration knobs (sometimes hundreds of them), and adjust physical designs as workloads change. Pavlo’s research through the Peloton and NoisePage projects demonstrated that ML models trained on workload telemetry data could make these decisions autonomously, often outperforming even expert human administrators. The concept has since influenced every major cloud database provider’s approach to automated management.

How did OtterTune work, and what happened to the company?

OtterTune was a cloud service that used machine learning to automatically tune the configuration parameters of existing database systems like PostgreSQL, MySQL, and Amazon RDS. The system worked by observing a database’s performance metrics under various configurations, building statistical models of how different knob settings affected performance, and iteratively recommending optimal configurations. Unlike the self-driving approach of NoisePage (which builds a new DBMS from scratch), OtterTune took a “black-box” approach that worked with any existing database system. The company raised $12 million in Series A funding and grew to $2.2 million in revenue, but shut down in June 2024 after a planned acquisition by a PostgreSQL company fell through.

Why are Andy Pavlo’s CMU database courses so influential?

Pavlo’s CMU courses — 15-445 (Introduction to Database Systems) and 15-721 (Advanced Database Systems) — have become the most widely-used free database education resources in the world. Several factors explain their influence: all lectures are recorded and posted to YouTube, all slides and notes are open-source, and the content covers both foundational concepts and cutting-edge research. Pavlo’s teaching style combines technical rigor with engaging delivery, making complex topics like multi-version concurrency control, query optimization, and storage engine design genuinely accessible. The courses have been viewed millions of times and are used as reference material by engineers at major technology companies worldwide, filling a gap that textbooks alone could not address.

What is the Database of Databases, and why did Pavlo create it?

The Database of Databases (dbdb.io) is an online encyclopedia that catalogs over 1,000 database management systems, maintained by Pavlo and the CMU Database Group. Each entry includes structured information about the system’s architecture, data model, query language, storage engine, concurrency control method, and other technical characteristics. Pavlo created DBDB because the database landscape has become extraordinarily fragmented — with hundreds of new systems appearing each year — and there was no comprehensive, neutral resource for comparing and understanding them. The project serves researchers evaluating related work, students learning about database design tradeoffs, practitioners selecting databases for new projects, and historians tracking the evolution of data management technology. It reflects Pavlo’s broader mission of making database knowledge systematically accessible to everyone.

Andy Pavlo: The Database Researcher Shaping the Future of Data Systems

Early Life and Education

Career and Technical Contributions

Technical Innovation

Why It Mattered

Other Notable Contributions

Philosophy and Key Principles

Legacy and Impact

Key Facts

Frequently Asked Questions

What is a self-driving database, and why did Andy Pavlo pioneer this concept?

How did OtterTune work, and what happened to the company?

Why are Andy Pavlo’s CMU database courses so influential?

What is the Database of Databases, and why did Pavlo create it?

Microsoft

Turing

Amazon

Leonard Bosack: Co-Founder of Cisco Systems and the Engineer Who Built the Internet’s Plumbing

Christos Papadimitriou: The Complexity Theorist Who Mapped the Boundaries of Computation and Wrote a Novel About Turing

Viral Shah: Co-Creator of the Julia Programming Language and Architect of High-Performance Scientific Computing

Popular on HyperWebEnable

Web Performance Optimization: A Developer’s Complete Guide for 2026

How Digital Agencies Actually Deliver Projects: Behind the Process

Best SaaS Tools for Small Development Teams in 2026

Taskee Review 2026: Is This the Task Manager Developers Have Been Waiting For?

How to Build a Portfolio Website That Actually Gets Clients

Essential Tools Every Web Designer Needs in 2026

Early Life and Education

Career and Technical Contributions

Technical Innovation

Why It Mattered

Other Notable Contributions

Philosophy and Key Principles

Legacy and Impact

Key Facts

Frequently Asked Questions

What is a self-driving database, and why did Andy Pavlo pioneer this concept?

How did OtterTune work, and what happened to the company?

Why are Andy Pavlo’s CMU database courses so influential?

What is the Database of Databases, and why did Pavlo create it?

Stay Updated

Related Articles

Related Companies

Microsoft

Turing

Amazon

Related Pioneers

Popular on HyperWebEnable