In 2017, at a packed SREcon Americas session, an engineer from New Relic stood at the podium and said something that made half the audience uncomfortable: the way most companies practiced Site Reliability Engineering was broken — not because the tooling was immature or the math behind error budgets was wrong, but because organizations treated operations work as invisible labor performed by invisible people. The engineer was Alice Goldfuss, and over the following years she would become one of the most influential voices in infrastructure engineering — not by inventing a new orchestration framework or founding a startup, but by relentlessly articulating what it actually means to run production systems at scale and why the humans doing that work deserve better than they typically get.
Goldfuss occupies a distinctive position in the history of modern infrastructure. She bridges the gap between the deeply technical world of containers, deployment pipelines, and distributed systems observability, and the organizational and cultural questions that determine whether those systems actually work in practice. Her writing, conference talks, and open-source contributions have shaped how an entire generation of engineers thinks about on-call rotations, incident response, operational toil, and the career trajectories of the people who keep the internet running. In an industry obsessed with building new things, Goldfuss has championed the unglamorous, essential work of keeping existing things alive.
Early Life and Path into Engineering
Alice Goldfuss grew up in the United States during the era when the internet was transitioning from an academic curiosity to the defining infrastructure of modern life. She came to technology through a combination of natural curiosity and the kind of hands-on experimentation that characterized many engineers of her generation — tinkering with computers, exploring early online communities, and gradually discovering that the systems behind the screen were at least as interesting as anything displayed on it.
Her formal education provided a foundation in computer science fundamentals, but like many of the most effective infrastructure engineers, Goldfuss’s real education came through operational experience. The gap between what computer science programs teach and what production systems demand is enormous. University courses cover algorithms, data structures, and software design patterns. They rarely cover what happens when a database failover triggers a cascade of connection pool exhaustion across fifty microservices at 3:00 AM on a Saturday, or how to coordinate an incident response across three time zones while customer-facing dashboards show degraded performance. Goldfuss would spend her career working in that gap — and eventually, helping others navigate it.
Her early career positions exposed her to the full spectrum of operational reality. She worked with systems that ranged from legacy monoliths running on bare metal to early containerized architectures that were still finding their footing. Each environment taught different lessons. Legacy systems taught the value of institutional knowledge and the danger of undocumented tribal wisdom. Newer architectures taught that complexity does not disappear when you decompose a monolith — it merely redistributes itself across network boundaries, making failures harder to predict and debug. These experiences gave Goldfuss something that many pure-software engineers lack: an intuitive understanding of how systems fail in production, not just how they succeed in development.
Career at New Relic and GitHub
Building Observability Infrastructure at New Relic
Goldfuss’s tenure at New Relic placed her at the intersection of two critical trends: the explosion of microservices architectures and the growing demand for sophisticated observability tooling. New Relic, as a major observability platform, was both building the tools that other companies used to monitor their systems and simultaneously managing the operational complexity of running those tools at scale. This created a fascinating recursive challenge: the infrastructure that monitored infrastructure had to be extraordinarily reliable, because when your monitoring goes down, you lose visibility into everything else.
At New Relic, Goldfuss worked on the kind of infrastructure problems that reveal the true nature of distributed systems engineering. She dealt with the operational realities of running large-scale data ingestion pipelines — systems that had to process billions of metric data points, traces, and events with low latency and high availability. This was not theoretical distributed systems work; it was the daily practice of keeping complex, interdependent services running while the underlying requirements constantly shifted. The experience sharpened her understanding of what Patrick Debois had originally advocated when coining the DevOps philosophy — that operational excellence requires deep collaboration between the people who build systems and the people who run them.
Her work at New Relic also exposed her to the organizational dynamics that would become central to her public advocacy. She saw how observability teams were often understaffed relative to the complexity they managed, how on-call burdens fell unevenly on specific individuals, and how the promotion structures in many engineering organizations systematically undervalued operational work compared to feature development. These observations would fuel some of her most impactful conference talks and writing.
Infrastructure Engineering at GitHub
Goldfuss’s move to GitHub represented a step into one of the most consequential infrastructure environments in the software industry. GitHub is not merely a code hosting platform — it is critical infrastructure for millions of developers and thousands of organizations worldwide. When GitHub experiences an outage, software development across the entire industry slows or stops. The operational stakes at GitHub are extraordinarily high, and the complexity of the platform — spanning Git’s distributed architecture, massive-scale storage systems, CI/CD pipelines, and a globally distributed user base — demands infrastructure engineering of the highest caliber.
At GitHub, Goldfuss worked on container infrastructure and deployment systems, tackling the challenge of running one of the world’s largest Ruby on Rails applications alongside an increasingly complex ecosystem of supporting services. The scale of GitHub’s operations — hundreds of millions of repositories, billions of Git objects, millions of daily API requests — created infrastructure challenges that few organizations ever encounter. Managing container orchestration at this scale required not just technical skill but the kind of operational judgment that comes from years of experience with production systems. This is the kind of environment where the principles championed by engineers like Mitchell Hashimoto with Terraform and infrastructure-as-code become not optional best practices but absolute necessities for survival.
A simplified example of the kind of container health checking and automated remediation that Goldfuss and her team had to manage at scale might look like this:
#!/usr/bin/env python3
"""
Container fleet health monitor — simplified example
of the kind of automated remediation logic required
when managing thousands of containers in production.
Real-world systems like those at GitHub operate at
far greater complexity with distributed coordination.
"""
import time
import logging
from dataclasses import dataclass, field
from enum import Enum
from typing import Dict, List, Optional
logger = logging.getLogger("fleet_health")
class ContainerState(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
DRAINING = "draining"
@dataclass
class HealthCheckResult:
container_id: str
state: ContainerState
latency_p99_ms: float
error_rate: float
memory_usage_pct: float
last_deploy_sha: str
consecutive_failures: int = 0
@dataclass
class FleetHealthMonitor:
"""Monitors container fleet and triggers remediation
when health thresholds are breached."""
containers: Dict[str, HealthCheckResult] = field(
default_factory=dict
)
error_rate_threshold: float = 0.05
latency_threshold_ms: float = 500.0
memory_threshold_pct: float = 85.0
max_consecutive_failures: int = 3
def evaluate_container(
self, result: HealthCheckResult
) -> ContainerState:
"""Determine container state from health metrics."""
if result.error_rate > self.error_rate_threshold:
return ContainerState.UNHEALTHY
if result.latency_p99_ms > self.latency_threshold_ms:
return ContainerState.DEGRADED
if result.memory_usage_pct > self.memory_threshold_pct:
return ContainerState.DEGRADED
return ContainerState.HEALTHY
def process_health_check(
self, result: HealthCheckResult
) -> Optional[str]:
"""Process a health check and return remediation
action if needed."""
state = self.evaluate_container(result)
result.state = state
self.containers[result.container_id] = result
if state == ContainerState.UNHEALTHY:
result.consecutive_failures += 1
if (result.consecutive_failures
>= self.max_consecutive_failures):
logger.warning(
"Container %s hit failure threshold "
"(%d consecutive). Draining and replacing.",
result.container_id,
result.consecutive_failures,
)
return self._initiate_replacement(
result.container_id
)
return "alert_on_call"
# Reset failure counter on healthy check
result.consecutive_failures = 0
return None
def _initiate_replacement(
self, container_id: str
) -> str:
"""Drain traffic and schedule container replacement.
In production this coordinates with the load balancer,
waits for in-flight requests, then provisions a
fresh container from the latest healthy image."""
self.containers[container_id].state = (
ContainerState.DRAINING
)
logger.info(
"Draining %s — shifting traffic to healthy peers",
container_id,
)
return "drain_and_replace"
def fleet_summary(self) -> Dict[str, int]:
"""Return count of containers in each state."""
summary: Dict[str, int] = {}
for c in self.containers.values():
key = c.state.value
summary[key] = summary.get(key, 0) + 1
return summary
This kind of automated health monitoring and remediation — evaluating container state, tracking consecutive failures, and initiating graceful replacements — represents just one layer of the operational machinery required to keep a platform like GitHub running. Goldfuss brought both the technical expertise to work on these systems and the communication skills to explain why this work mattered to the broader engineering community.
Contributions to Operations Culture
Redefining On-Call and Incident Response
One of Goldfuss’s most significant contributions has been her work articulating what healthy on-call practices look like — and, more importantly, what unhealthy ones cost. On-call duty is one of the defining experiences of infrastructure engineering. It means carrying a pager (now typically a phone notification) and being responsible for responding to production incidents at any hour of the day or night. Done well, on-call is a manageable responsibility shared equitably across a team with proper tooling, runbooks, and escalation paths. Done poorly — which is far more common — it becomes a source of burnout, sleep deprivation, and attrition that disproportionately affects the most experienced and most conscientious engineers.
Goldfuss has been vocal about the organizational failures that create toxic on-call environments: rotations that are too thin because teams are understaffed, alert fatigue caused by poorly tuned monitoring that pages engineers for non-actionable events, lack of follow-through on incident postmortems that means the same problems page people repeatedly, and management structures that treat on-call as a cost center rather than a critical capability. Her talks on these subjects have influenced how companies like those reviewed on toimi.pro structure their engineering organizations, particularly around the balance between feature development and operational investment.
Her advocacy for blameless postmortems — a practice championed by Google’s SRE methodology and refined by practitioners across the industry — extended beyond the mechanical process of writing incident reports. Goldfuss emphasized that blameless postmortems only work in organizations where psychological safety is genuine, not performative. If engineers fear that honest analysis of an incident will lead to punishment, they will sanitize their postmortems, and the organization will lose its most valuable source of learning. This insight connected operational practice to organizational culture in ways that resonated far beyond the SRE community.
The Visibility of Operational Work
Perhaps Goldfuss’s most distinctive contribution is her sustained argument that operational work suffers from a visibility problem that has real consequences for the people who do it. In most engineering organizations, the promotion criteria and performance review systems are optimized to recognize work that creates something new: a new feature, a new service, a new product. Operational work — keeping existing systems running, reducing toil, improving reliability, responding to incidents — is by its nature less visible. When operations work is done well, nothing happens. The absence of failure is the signal of success, and absences are inherently hard to measure and reward.
This visibility asymmetry creates perverse incentives. Engineers who want to advance their careers are implicitly encouraged to build new things rather than maintain existing ones. The result is a steady accumulation of operational debt: systems that work but are fragile, monitoring that exists but is incomplete, runbooks that were written once and never updated, deployment processes that rely on one person’s tribal knowledge. Goldfuss has argued consistently that this pattern is not just unfair to operations engineers — it is actively harmful to the organizations that allow it to persist. The teams building features depend entirely on the infrastructure that operations engineers maintain, and underinvesting in that infrastructure is a form of organizational self-harm.
This argument parallels the broader conversation about infrastructure investment that engineers like Jeff Dean at Google have championed from the systems architecture side. Where Dean demonstrated the technical value of investing in distributed infrastructure, Goldfuss has articulated the human and organizational value of investing in the people who run that infrastructure.
Advocacy for Diversity in Infrastructure Engineering
Goldfuss has been a visible advocate for diversity and inclusion in infrastructure engineering, a subfield that has historically been even less diverse than the broader software industry. Her advocacy has been characteristically practical rather than abstract — focused on specific, actionable changes that organizations can make to their hiring practices, on-call policies, conference speaking opportunities, and mentorship programs.
She has spoken openly about the challenges of being a woman in infrastructure engineering, a field where the culture has traditionally been shaped by assumptions about who belongs in a server room or on an on-call rotation. Her willingness to discuss these challenges publicly has made it easier for other underrepresented engineers to enter and remain in the field. This kind of representation matters: when aspiring infrastructure engineers see someone who looks like them speaking at SREcon about container orchestration or incident command, it expands their sense of what is possible for their own careers.
Technical Philosophy
Operations as Engineering, Not Janitorial Work
Central to Goldfuss’s philosophy is the insistence that operations is a legitimate engineering discipline — not a lesser form of software development, not a support function, and certainly not janitorial work. This distinction matters because the language organizations use to describe operational work shapes how they invest in it. When operations is framed as “keeping the lights on,” it becomes a cost to be minimized rather than a capability to be developed. When it is framed as engineering — with its own body of knowledge, its own design patterns, its own career ladder — it becomes something organizations can invest in strategically.
This philosophical position has practical implications for everything from hiring to architecture. If operations is engineering, then operations teams need the same investment in tooling, training, and career development that product engineering teams receive. If operations is engineering, then operational concerns — reliability, observability, incident response — should be first-class considerations in system design, not afterthoughts bolted on after deployment. This aligns with the SRE model that Google popularized, but Goldfuss has pushed the argument further by addressing the cultural and organizational dimensions that Google’s original SRE book sometimes treated as secondary.
The Importance of Writing and Communication
Goldfuss is an exceptionally skilled technical writer, and she has consistently argued that writing is an undervalued skill in infrastructure engineering. Her blog posts combine technical depth with clarity and wit, making complex operational concepts accessible to engineers who might not have direct experience with them. She has written about topics ranging from the mechanics of container networking to the psychology of incident response, always with an emphasis on practical insight over theoretical abstraction.
Her approach to technical communication reflects a deeper conviction that the biggest problems in infrastructure are not purely technical — they are problems of coordination, communication, and shared understanding. The most sophisticated monitoring system in the world is useless if the engineers who need to act on its alerts do not understand what they mean. The most elegant deployment pipeline is worthless if the team does not have shared mental models of how it works and what to do when it breaks. Writing — clear, honest, detailed writing — is the primary tool for building those shared mental models.
An example of the kind of structured operational documentation Goldfuss advocates — incident runbooks that encode operational knowledge into actionable, executable form — might look like this:
# Incident Runbook: Database Connection Pool Exhaustion
# This structured format ensures on-call engineers
# have actionable steps, not just tribal knowledge.
# Goldfuss advocates making runbooks executable and testable.
runbook:
title: "Database Connection Pool Exhaustion"
severity: P1
last_reviewed: "2025-06-20"
owner: "infrastructure-team"
symptoms:
- "Application latency spike above 2s p99"
- "Connection timeout errors in service logs"
- "Database active connections at or near max_connections"
- "Upstream services returning 503 errors"
diagnosis:
steps:
- description: "Check current connection count"
command: |
SELECT count(*) AS active,
max_conn AS maximum,
round(count(*)::numeric / max_conn * 100, 1)
AS utilization_pct
FROM pg_stat_activity,
(SELECT setting::int AS max_conn
FROM pg_settings
WHERE name = 'max_connections') mc
WHERE state != 'idle';
expected: "Utilization below 80%"
if_exceeded: "Proceed to mitigation"
- description: "Identify connection consumers"
command: |
SELECT application_name,
client_addr,
count(*) AS conn_count,
count(*) FILTER (
WHERE state = 'active'
) AS active_queries
FROM pg_stat_activity
GROUP BY application_name, client_addr
ORDER BY conn_count DESC
LIMIT 10;
note: "Look for a single service consuming
disproportionate connections"
- description: "Check for long-running queries"
command: |
SELECT pid, now() - pg_stat_activity.query_start
AS duration,
query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start)
> interval '5 minutes'
ORDER BY duration DESC;
mitigation:
immediate:
- action: "Terminate idle connections from top consumer"
command: |
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE application_name = '{offending_app}'
AND state = 'idle'
AND query_start < now() - interval '10 min';
caution: "Only terminate idle connections.
Active queries may be doing legitimate work."
- action: "Scale up connection pool on affected service"
note: "Increase pool_size in service config and
restart pods in rolling fashion"
long_term:
- "Implement connection pooler (PgBouncer)
between services and database"
- "Add per-service connection limits to prevent
single-service monopolization"
- "Set up alerting on connection utilization
at 70% threshold"
communication:
- threshold: "P1 declared"
action: "Post in #incidents Slack channel"
- threshold: "Customer impact confirmed"
action: "Update status page, notify support team"
- threshold: "Resolved"
action: "Post timeline, schedule postmortem
within 48 hours"
This approach to operational documentation — structured, version-controlled, executable, and regularly reviewed — embodies Goldfuss's philosophy that operational knowledge should be encoded in systems, not trapped in individual engineers' heads. When the on-call engineer at 3:00 AM is not the person who originally designed the system, the quality of the runbook is the difference between a fifteen-minute resolution and a four-hour investigation.
Conference Speaking and Community Impact
Goldfuss is a sought-after conference speaker whose presentations at events like SREcon, KubeCon, Monitorama, and DevOpsDays have reached thousands of engineers. Her talks are distinctive for their combination of technical rigor, practical relevance, and willingness to address the human dimensions of infrastructure work that many technical speakers avoid.
Her SREcon presentations on topics like operational visibility, on-call health, and the organizational dynamics of infrastructure teams have been particularly influential. These talks do not just describe problems — they offer specific, implementable recommendations that engineering leaders can take back to their organizations. This practical orientation is part of what makes Goldfuss's work so impactful: she speaks to the realities of working engineers, not to abstract ideals.
Beyond formal conference talks, Goldfuss has contributed to the infrastructure engineering community through her active presence on social media, where her commentary on operational practices, industry trends, and the culture of engineering organizations reaches a large and engaged audience. She has been a mentor to many engineers entering the infrastructure field, particularly those from underrepresented backgrounds who might otherwise lack access to the informal networks and institutional knowledge that career advancement in this field often requires. This mentorship work, like the operational work she advocates for, is largely invisible but enormously impactful. Teams using modern project management tools like those reviewed on taskee.pro understand the importance of making invisible work visible through proper tracking and recognition.
Influence on Modern SRE Practice
Goldfuss's influence on modern SRE practice extends far beyond her individual contributions. She is part of a generation of infrastructure engineers who have collectively redefined what it means to run production systems — moving the discipline from ad-hoc system administration toward a rigorous engineering practice with its own principles, patterns, and career paths.
Her emphasis on operational culture has influenced how companies structure their SRE and platform engineering teams. The growing recognition that on-call health, incident response maturity, and operational toil reduction are first-class engineering objectives — not afterthoughts — owes something to Goldfuss and the community of practitioners she has helped build. When a startup today designs its on-call rotation with explicit attention to equitable distribution, sustainable paging volumes, and meaningful postmortems, it is drawing on ideas that Goldfuss has articulated and championed.
Her work also connects to the broader evolution of cloud-native infrastructure. As organizations have adopted container technologies and modern frameworks, the operational complexity has increased dramatically. The move from managing a handful of servers to orchestrating thousands of containers across multiple cloud regions has made the kind of operational discipline Goldfuss advocates not just desirable but essential. The principles she has championed — automation over manual intervention, documentation over tribal knowledge, equitable on-call over heroic firefighting — are prerequisites for operating successfully at cloud-native scale.
Her advocacy for the people who do operational work — recognizing their contributions, investing in their careers, protecting their well-being — has contributed to a cultural shift in how the industry values infrastructure engineering. The emergence of dedicated platform engineering roles, the formalization of SRE career ladders, and the growing attention to on-call health and engineer burnout all reflect the arguments that Goldfuss and others in her community have been making for years.
Legacy and Ongoing Impact
Alice Goldfuss's legacy in the technology industry is defined by a consistent thread: the insistence that running production systems is real engineering that deserves real investment — in tooling, in process, and above all in the people who do the work. In an industry that celebrates founders, framework authors, and keynote speakers, Goldfuss has championed the engineers whose work is measured by the absence of failure rather than the presence of novelty.
Her influence is visible in the SRE community's growing sophistication around operational culture, in the language that engineering organizations use to describe and value infrastructure work, and in the career paths of the many engineers she has mentored and inspired. She has demonstrated that technical excellence and human advocacy are not competing priorities but complementary ones — that the best infrastructure organizations are those that invest in both their systems and their people.
As the industry continues to evolve — as infrastructure becomes more complex, as the stakes of operational failures grow higher, as the demands on infrastructure engineers intensify — the ideas that Goldfuss has championed will only become more relevant. The question is not whether organizations will adopt the principles she advocates but how quickly. Those that invest in operational culture, treat their infrastructure engineers as first-class contributors, and take the human dimensions of production operations seriously will build more reliable systems, retain better talent, and ultimately deliver more value. That is the argument Alice Goldfuss has been making for her entire career, and the industry is slowly, steadily proving her right.
Frequently Asked Questions
Who is Alice Goldfuss and what is she known for?
Alice Goldfuss is a Site Reliability Engineer and infrastructure engineering advocate known for her influential work at companies like New Relic and GitHub. She is recognized as one of the most articulate voices in modern operations culture, focusing on the human and organizational dimensions of running production systems at scale — including on-call practices, incident response, operational toil, and the career development of infrastructure engineers.
What is Site Reliability Engineering (SRE)?
Site Reliability Engineering is a discipline that applies software engineering principles to infrastructure and operations work. Originally developed at Google, SRE focuses on building automated systems for managing infrastructure, defining reliability targets through Service Level Objectives (SLOs), and balancing feature development velocity against system stability. Goldfuss has been influential in advocating for the cultural and organizational aspects of SRE beyond the purely technical practices.
What are Alice Goldfuss's main contributions to the tech industry?
Goldfuss's contributions span technical infrastructure work at major platforms and broad community advocacy. Her most significant contributions include her work on container infrastructure and deployment systems at GitHub, her influential conference talks on operational culture and on-call health at events like SREcon and Monitorama, her technical writing on infrastructure engineering topics, and her advocacy for making operational work visible and valued within engineering organizations.
Why is operational culture important in software engineering?
Operational culture determines how organizations handle the ongoing work of running production systems — including incident response, monitoring, deployment, and on-call rotations. Poor operational culture leads to engineer burnout, unreliable systems, and high attrition among infrastructure staff. Good operational culture, which Goldfuss has championed throughout her career, includes equitable on-call distribution, blameless postmortems, investment in automation and tooling, and organizational structures that recognize and reward operational excellence alongside feature development.
What is the connection between SRE and DevOps?
SRE and DevOps share common goals — breaking down silos between development and operations, automating manual processes, and improving software delivery reliability. DevOps, coined by Patrick Debois, is primarily a cultural and organizational movement. SRE, originated at Google, is a more prescriptive implementation that defines specific practices like error budgets, SLOs, and toil tracking. Goldfuss's work sits at the intersection, applying SRE rigor while emphasizing the cultural foundations that DevOps highlights.
How has Alice Goldfuss influenced diversity in infrastructure engineering?
Goldfuss has been a visible advocate for diversity and inclusion in infrastructure engineering through her public speaking, mentorship of underrepresented engineers, and willingness to openly discuss the challenges of being a woman in a male-dominated subfield. Her presence at major conferences and in the broader SRE community has helped expand the perception of who belongs in infrastructure roles, and her practical advocacy for inclusive hiring practices and equitable on-call policies has influenced organizational change across the industry.