Tech Pioneers

Charity Majors: The Observability Pioneer Who Taught Engineers to Understand Production

Charity Majors: The Observability Pioneer Who Taught Engineers to Understand Production

When a distributed system fails at two in the morning, the first question is always the same: what changed? For decades, engineers answered that question by searching through logs, scanning dashboards of pre-aggregated metrics, and hoping that someone had anticipated the failure mode well enough to build a monitor for it. Then Charity Majors came along and argued that the entire approach was fundamentally broken. The problem, she insisted, was not that engineers lacked data — it was that they lacked the ability to ask arbitrary questions of their data in real time. Metrics and logs told you what you had already decided to measure. What you needed was the ability to explore the unknown unknowns — the failure modes nobody anticipated, the interactions between services that nobody predicted, the latency spikes that no dashboard was designed to catch. This insight — that production systems require exploratory, high-cardinality analysis rather than pre-defined dashboards — became the foundation of the modern observability movement. Majors co-founded Honeycomb.io in 2016 to build the tooling that would make this vision practical, and in doing so, she changed how an entire generation of engineers thinks about understanding production software.

Early Career: From Systems Administration to Facebook Scale

Charity Majors grew up in rural Idaho and developed an early fascination with computers. She started her career as a systems administrator and infrastructure engineer, working her way through a series of companies where she developed a deep, hands-on understanding of how production systems actually behave. She was not an academic theorist — she was an operator, someone who carried a pager and got woken up when things broke. This operational perspective would inform everything she later built and advocated for.

Her career trajectory accelerated when she joined Parse, a mobile backend-as-a-service platform that let developers build mobile applications without managing their own servers. Parse was running a massive MongoDB infrastructure, handling hundreds of thousands of requests per second from mobile apps around the world. Managing that infrastructure — keeping it reliable, understanding its failure modes, diagnosing performance problems across a complex distributed stack — gave Majors a masterclass in the limitations of existing monitoring tools.

In 2013, Facebook acquired Parse, and Majors moved to Facebook’s infrastructure team. At Facebook scale, the problems she had encountered at Parse were amplified by orders of magnitude. Facebook was operating one of the largest distributed systems on the planet, with billions of requests per day flowing through thousands of interconnected services. Traditional monitoring — collecting metrics, building dashboards, setting threshold alerts — simply could not keep pace with the complexity. When something went wrong in a system that large, the failure mode was almost always novel. It was an interaction between components that nobody had predicted, a cascade of effects that no pre-built dashboard could capture. Engineers needed to be able to ask new questions of their data on the fly, slicing and dicing across dozens of dimensions simultaneously. The existing tools — even in the DevOps-aware era — were not built for that kind of exploration.

The Founding of Honeycomb

The Core Problem

When Majors left Facebook in 2016 to co-found Honeycomb with Christine Yen, she had a clear diagnosis of what was wrong with the monitoring industry. The traditional approach — which she called the “three pillars” of observability (metrics, logs, and traces) — treated each data type as a separate silo. Metrics went to one system, logs to another, traces to a third. Each system had its own query language, its own retention policies, its own limitations. Worse, all three systems required engineers to decide in advance what data to collect and how to aggregate it. You could not ask a question that crossed the boundaries between these systems, and you could not ask a question that nobody had anticipated.

Majors argued for a fundamentally different model: structured events. Instead of splitting observability data into three separate types, Honeycomb treated every request, every database query, every function call as a single structured event — a wide, denormalized record containing every piece of context that might be relevant. A single event might contain hundreds of fields: the user ID, the endpoint, the response time, the database query count, the deployment version, the feature flags enabled, the geographic region, the device type, and anything else the engineer decided to attach. By storing these rich, wide events and providing a query engine that could slice across any combination of fields in real time, Honeycomb enabled the kind of exploratory analysis that Majors had been craving throughout her career.

The Technical Architecture

Honeycomb’s technical design reflected Majors’ operational philosophy. The system was built to handle high-cardinality data — fields with millions of unique values, like user IDs or request UUIDs — which traditional metrics systems could not accommodate. It used a columnar storage engine optimized for the kind of GROUP BY and BREAK DOWN queries that characterize exploratory analysis. And it was fast enough to return results interactively, so engineers could iterate on their queries in real time rather than waiting minutes for batch jobs to complete.

# Instrumenting a web service with Honeycomb's structured events
# Instead of separate metrics, logs, and traces,
# you emit rich, wide events with all relevant context

import honeycomb

def handle_request(request):
    # Start building a structured event — one per request
    event = honeycomb.Event()
    
    # Attach every piece of context that might matter later
    # This is the "wide event" approach Charity Majors advocates
    event.add_field("endpoint", request.path)
    event.add_field("method", request.method)
    event.add_field("user_id", request.user.id)
    event.add_field("user_plan", request.user.plan)       # billing tier
    event.add_field("deploy_version", DEPLOY_SHA)          # which build
    event.add_field("feature_flags", request.active_flags) # A/B tests
    event.add_field("region", request.geo.region)
    event.add_field("device_type", request.device_type)
    
    start = time.monotonic()
    
    try:
        # Process the request
        db_start = time.monotonic()
        result = database.query(request.query_params)
        event.add_field("db_duration_ms", (time.monotonic() - db_start) * 1000)
        event.add_field("db_rows_returned", len(result))
        
        cache_hit = cache.check(request.cache_key)
        event.add_field("cache_hit", cache_hit)
        
        response = build_response(result)
        event.add_field("status_code", 200)
        event.add_field("response_size_bytes", len(response.body))
        
    except DatabaseTimeout as e:
        event.add_field("status_code", 504)
        event.add_field("error", "database_timeout")
        event.add_field("db_query", str(request.query_params))
        response = error_response(504)
        
    except Exception as e:
        event.add_field("status_code", 500)
        event.add_field("error", type(e).__name__)
        event.add_field("error_message", str(e))
        response = error_response(500)
    
    finally:
        event.add_field("duration_ms", (time.monotonic() - start) * 1000)
        # Send the event — Honeycomb can now query across ALL these fields
        # "Show me p99 latency broken down by deploy_version and region"
        # "Find all requests where cache_hit=false AND db_duration > 500ms"
        # "Which user_plan has the highest error rate this hour?"
        event.send()
    
    return response

This approach was a direct challenge to the way the monitoring industry had operated for years. Companies like Datadog, New Relic, and Splunk had built empires on the “three pillars” model, and their products reflected the assumption that metrics, logs, and traces were fundamentally different kinds of data requiring fundamentally different storage and query systems. Majors argued — loudly, persistently, and with considerable technical credibility — that this architecture was an artifact of historical tooling limitations, not a reflection of how engineers actually needed to work. The distinction she drew was between monitoring (checking known conditions against known thresholds) and observability (exploring unknown conditions in complex systems). Both were necessary, she argued, but the industry had invested heavily in monitoring while neglecting observability.

The Observability Movement

Majors did not just build a product — she built a movement. Through relentless writing, speaking, and social media engagement, she articulated a philosophy of production engineering that went far beyond any single tool. Her central argument was that in the era of containerized, orchestrated microservices, the traditional debugging workflow — reproduce the bug locally, step through the code in a debugger, find the root cause — was no longer viable. Modern production systems exhibited emergent behaviors that could not be reproduced in development environments. The only way to understand these systems was to instrument them thoroughly and then explore the resulting data interactively.

This philosophy had several key tenets. First, you must be able to understand any system state without having predicted it in advance. This meant that pre-defined dashboards and alerts were insufficient — they only caught the failure modes that somebody had anticipated. Second, high-cardinality data was essential. If you could not break down your data by individual user ID, request ID, or deployment version, you could not do real debugging. Third, engineers should own their own code in production — not just write it and throw it over the wall to an operations team. This perspective aligned naturally with the broader DevOps movement and with practices like continuous delivery, but Majors pushed it further, arguing that the ability to observe production was not just an operational concern but a fundamental engineering skill.

Her influence extended beyond Honeycomb’s customer base. The term “observability” — borrowed from control theory, where it describes the degree to which a system’s internal state can be inferred from its external outputs — became the dominant framing for an entire category of infrastructure tooling. Industry researchers like Nicole Forsgren began incorporating observability practices into their frameworks for measuring engineering effectiveness. Competitors adopted the language and, gradually, the architectural approach. By the early 2020s, “observability” had largely replaced “monitoring” as the preferred term in the infrastructure community, and structured events were widely recognized as a legitimate alternative to the three-pillars model.

Philosophy and Engineering Approach

Testing in Production

One of Majors’ most provocative and influential arguments was her defense of “testing in production.” This phrase, which sounds reckless to anyone trained in traditional software engineering, was deliberately chosen to challenge assumptions. Majors’ argument was not that you should skip testing before deployment — she advocated for comprehensive automated testing, continuous integration and delivery, and careful staging environments. Her argument was that no amount of pre-production testing could fully simulate production conditions. Real users, real data volumes, real network conditions, real interactions between dozens of services — these could only be observed in production itself. Therefore, every deployment to production was, in some sense, a test, and the engineering team needed the tools and practices to detect and respond to problems immediately.

This led to her advocacy for practices like feature flags, canary deployments, progressive rollouts, and — above all — rich instrumentation that let engineers see exactly what was happening during and after each deployment. She popularized the concept of “deploy, observe, iterate” as a replacement for the older “develop, test, deploy” cycle. The key insight was that observation was not a passive activity that happened after deployment but an active, integral part of the deployment process itself.

Sociotechnical Perspectives

Majors was also notable for her attention to the human and organizational dimensions of software engineering. She wrote extensively about on-call culture, burnout, team dynamics, and the relationship between tooling and engineering culture. She argued that bad observability tools did not just cause technical problems — they caused human problems. Engineers who could not understand their production systems became anxious, burned out, and defensive. They avoided deploying changes because they feared what might break. They hoarded knowledge because only a few senior engineers understood the system well enough to debug it. Better observability tools, she argued, democratized understanding — they let any engineer, regardless of seniority, explore production data and develop confidence in their ability to understand and fix problems.

# Honeycomb query patterns that Majors advocates for production debugging
# These demonstrate the "exploratory analysis" approach to observability

# Pattern 1: Find the needle in the haystack
# "Why are some users seeing slow responses?"
# Start broad, then narrow down interactively
QUERY: HEATMAP(duration_ms) GROUP BY endpoint
  → spot bimodal distribution on /api/search
  → BREAK DOWN BY user_plan
  → discover "enterprise" plan users are 10x slower
  → BREAK DOWN BY db_rows_returned
  → root cause: enterprise accounts return 50k+ rows unpagatinated

# Pattern 2: Deployment verification
# "Did this deploy make things better or worse?"
QUERY: P99(duration_ms) WHERE deploy_version IN (v2.3.1, v2.3.2)
  GROUP BY deploy_version, endpoint
  VISUALIZE: time series, last 2 hours
  → compare latency curves side-by-side per endpoint
  → spot that /api/export regressed by 300ms in v2.3.2

# Pattern 3: Trace-based debugging without traditional traces
# "Show me the full journey of a slow request"
QUERY: * WHERE trace.trace_id = "abc-123-def"
  ORDER BY timestamp ASC
  → every span/event for this request across all services
  → see exactly where time was spent
  → identify the downstream service that caused the bottleneck

# Pattern 4: SLO-based analysis
# "Are we meeting our latency SLO? If not, who is affected?"
QUERY: COUNT WHERE duration_ms > 500
  / COUNT * 100 AS error_budget_burn
  GROUP BY region, user_plan
  HAVING error_budget_burn > 1.0
  → shows which segments are burning error budget fastest
  → actionable: fix the specific region+plan combination

Impact on the Industry

Honeycomb, under Majors’ technical leadership as CTO, grew from a two-person startup into a significant force in the infrastructure tooling market. The company raised multiple rounds of venture funding, attracted customers ranging from startups to large enterprises, and built a passionate community of practitioners who adopted the observability-first approach. But Majors’ influence extended well beyond Honeycomb’s direct business impact.

She changed the vocabulary of the industry. The term “observability” moved from an obscure control-theory concept to a mainstream engineering term used in job postings, conference talks, and vendor marketing materials worldwide. She shifted the Overton window on what engineers expected from their production tooling — high-cardinality support, fast interactive queries, and structured events went from being niche requirements to standard expectations. Even competitors who disagreed with her specific architectural choices adopted the language and concepts she popularized.

She also influenced engineering culture. Her writing about on-call practices, deployment anxiety, and the relationship between tooling and team health resonated with a generation of engineers who felt that the industry’s operational practices had not kept pace with its architectural ambitions. You cannot build high-performance microservices and then debug them with the same tools and processes you used for monolithic applications. Majors articulated this mismatch clearly and offered both philosophical and practical solutions. Her advice to engineering teams — instrument aggressively, deploy frequently, observe carefully, and learn from production — became a widely adopted playbook for modern software operations.

Her public presence — through blog posts, conference keynotes, Twitter threads, and her newsletter — made her one of the most visible and influential voices in infrastructure engineering. She was known for direct, sometimes blunt communication, for a willingness to challenge industry orthodoxy, and for an engineering philosophy rooted in operational reality rather than theoretical elegance. She earned the respect of both practitioners and industry leaders by consistently backing up her opinions with deep technical knowledge and real-world experience.

The Broader Vision: Observability as Engineering Practice

Majors’ vision for observability extended beyond any single tool or technique. She argued that observability was not a product category but a property of a system — the degree to which you could understand the system’s internal state by examining its outputs. A system was observable not because you had installed a particular vendor’s product, but because you had instrumented it thoroughly, stored the right data, and built the organizational practices to exploit that data effectively.

This framing had profound implications. It meant that observability was an engineering discipline, not a purchasing decision. It required investment in instrumentation, in data literacy, in a culture of curiosity and exploration. It meant that the best observability setup in the world was useless without engineers who knew how to ask the right questions. And it meant that observability was not a one-time project but an ongoing practice — as systems evolved, the instrumentation had to evolve with them.

Majors connected this vision to the broader arc of software engineering maturity. Just as the industry had gradually adopted version control, automated testing, containerization, and continuous delivery, she argued that observability would become a non-negotiable baseline practice — something that every serious engineering team did as a matter of course. The trajectory of the industry in the years since she founded Honeycomb has largely vindicated this prediction. Observability has moved from the bleeding edge to the mainstream, and the structured-event approach that Honeycomb pioneered has become a widely accepted model.

Key Facts

  • Full name: Charity Majors
  • Role: Co-founder and CTO of Honeycomb.io
  • Known for: Pioneering the modern observability movement, advocating for structured events over the “three pillars” model, popularizing testing-in-production practices
  • Previous experience: Infrastructure engineer at Parse and Facebook, extensive background in systems administration
  • Company founded: 2016, co-founded Honeycomb with Christine Yen
  • Key contribution: Reframing production debugging from pre-defined monitoring to exploratory, high-cardinality analysis
  • Philosophy: “Observability is not about dashboards — it is about the ability to ask arbitrary questions of your systems without having to predict them in advance”
  • Influence: Shifted industry vocabulary from “monitoring” to “observability,” influenced engineering culture around on-call practices, deployment confidence, and production ownership

Frequently Asked Questions

Who is Charity Majors?

Charity Majors is the co-founder and CTO of Honeycomb.io, a company that builds observability tools for software engineering teams. She is widely recognized as one of the leading voices in the modern observability movement. Before founding Honeycomb, she worked as an infrastructure engineer at Parse and later at Facebook (which acquired Parse in 2013), where she managed large-scale distributed systems. Her experience operating complex production infrastructure led her to identify fundamental limitations in traditional monitoring tools and to develop the structured-event approach to observability that Honeycomb implements.

What is observability and how does it differ from monitoring?

Monitoring is the practice of collecting pre-defined metrics and checking them against known thresholds — for example, alerting when CPU usage exceeds 90% or when error rates cross a certain threshold. Observability, as Majors defines it, is the ability to understand a system’s internal state by examining its outputs, without having to predict in advance what questions you will need to ask. In practical terms, monitoring tells you when something known is broken, while observability lets you explore why something unknown is happening. Majors argues that in complex distributed systems, novel failure modes are the norm rather than the exception, making exploratory observability essential for effective debugging.

What is a structured event in the context of Honeycomb?

A structured event is a single, wide, denormalized data record that captures all the context about a specific action in a system — typically one HTTP request, one database query, or one function invocation. Unlike traditional approaches that split data into separate metrics, logs, and traces, a structured event combines all relevant fields into one record. A single event might include the endpoint, response time, user ID, deployment version, feature flags, database query count, cache hit status, and dozens of other fields. This wide format allows engineers to query across any combination of fields in real time, enabling the exploratory analysis that defines observability.

What does Charity Majors mean by testing in production?

Testing in production does not mean skipping pre-production testing. Majors advocates for comprehensive automated testing, continuous integration, and staging environments. Her argument is that no pre-production environment can fully replicate real production conditions — real user behavior, real data volumes, real network conditions, and real interactions between services. Therefore, every production deployment is inherently a kind of test, and engineering teams need the instrumentation and practices to observe what happens during and after each deployment. This includes feature flags, canary deployments, progressive rollouts, and rich observability tooling that makes it safe to deploy frequently and detect problems immediately.

Why does high-cardinality data matter for observability?

High-cardinality data refers to fields with many unique values — such as user IDs, request IDs, session tokens, or deployment versions. Traditional metrics systems struggle with high-cardinality data because they pre-aggregate values, and pre-aggregation becomes impractical when a field has millions of unique values. Majors argues that high-cardinality fields are precisely the ones you need for real debugging. When a single user reports a problem, you need to filter by their user ID. When a specific deployment causes a regression, you need to group by deployment version. Without high-cardinality support, engineers are forced to debug blind, relying on aggregated averages that hide the specific conditions causing problems.

How has Charity Majors influenced engineering culture?

Beyond her technical contributions, Majors has significantly influenced how engineering organizations think about production ownership, on-call practices, and the relationship between tooling and team health. She argues that engineers who cannot understand their production systems become anxious and avoidant — they fear deployments, hoard knowledge, and burn out. Better observability tools democratize understanding, allowing any team member to explore production data confidently. She advocates for engineers owning their code in production rather than delegating operational responsibility to separate teams, and she connects observability practices to broader engineering effectiveness metrics that organizations use to measure team performance and software delivery capability.