When your web application serves thousands of users, understanding what happens behind the scenes becomes critical. Monitoring and observability are not interchangeable terms — monitoring tells you when something breaks, while observability helps you understand why. Together, they form the foundation of reliable, performant web applications.
The modern observability stack has matured around open-source tools like Prometheus for metrics collection, Grafana for visualization, and a growing ecosystem of complementary tools for logs, traces, and alerts. This guide walks you through building a production-grade monitoring setup for web applications, from instrumenting your code to creating actionable dashboards.
Understanding the Three Pillars of Observability
Before diving into specific tools, it is important to understand what observability actually means in the context of web applications. The concept borrows from control theory: a system is observable if you can determine its internal state from its external outputs. For web applications, this translates into three pillars.
Metrics
Metrics are numerical measurements collected over time. They answer questions like “How many requests per second is my API handling?” or “What is the 95th percentile response time?” Metrics are lightweight, easy to aggregate, and excellent for alerting. Prometheus has become the de facto standard for metrics collection in cloud-native environments.
Logs
Logs are structured or unstructured text records of discrete events. They provide detailed context about what happened at a specific moment — an error stack trace, an authentication failure, a database query timeout. Tools like Loki (by Grafana Labs), Elasticsearch, and Fluentd handle log aggregation at scale.
Traces
Distributed traces follow a single request as it travels through multiple services. In a microservices architecture, a single user action might touch ten different services. Traces connect these interactions, showing you exactly where latency occurs. Jaeger and Tempo are popular open-source tracing solutions.
Prometheus: The Metrics Engine
Prometheus, originally developed at SoundCloud and now a graduated Cloud Native Computing Foundation project, uses a pull-based model for metrics collection. Rather than applications pushing metrics to a central server, Prometheus periodically scrapes HTTP endpoints exposed by your services. This approach simplifies configuration and makes it easy to monitor dynamic environments like Kubernetes clusters.
Core Concepts
Prometheus organizes data around a multi-dimensional data model. Each metric has a name and a set of key-value pairs called labels. For example, http_requests_total{method="GET", status="200", handler="/api/users"} tracks successful GET requests to the users endpoint. This labeling system enables powerful querying without predefined aggregation.
There are four metric types in Prometheus:
- Counter — a cumulative value that only goes up (e.g., total requests served). Resets to zero on restart.
- Gauge — a value that can go up or down (e.g., current memory usage, active connections).
- Histogram — samples observations and counts them in configurable buckets (e.g., request duration distribution).
- Summary — similar to histogram but calculates configurable quantiles over a sliding time window.
Instrumenting a Node.js Application
Adding Prometheus metrics to a Node.js web application requires the prom-client library. The following example demonstrates how to expose key HTTP metrics from an Express application, including request counts, response times, and active connection tracking.
const express = require('express');
const client = require('prom-client');
// Create a Registry to register metrics
const register = new client.Registry();
// Add default metrics (CPU, memory, event loop lag, etc.)
client.collectDefaultMetrics({ register, prefix: 'webapp_' });
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'webapp_http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
register.registerMetric(httpRequestDuration);
const httpRequestsTotal = new client.Counter({
name: 'webapp_http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
register.registerMetric(httpRequestsTotal);
const activeConnections = new client.Gauge({
name: 'webapp_active_connections',
help: 'Number of active connections'
});
register.registerMetric(activeConnections);
const app = express();
// Middleware to track metrics for every request
app.use((req, res, next) => {
activeConnections.inc();
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const route = req.route ? req.route.path : req.path;
const labels = {
method: req.method,
route: route,
status_code: res.statusCode
};
end(labels);
httpRequestsTotal.inc(labels);
activeConnections.dec();
});
next();
});
// Metrics endpoint for Prometheus to scrape
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.get('/api/users', (req, res) => {
res.json({ users: [] });
});
app.listen(3000, () => {
console.log('Server running on port 3000');
console.log('Metrics available at /metrics');
});
This instrumentation exposes a /metrics endpoint that Prometheus scrapes at regular intervals. The default metrics include Node.js runtime information like heap size, event loop lag, and garbage collection statistics — all valuable for diagnosing performance issues. You can verify the metrics endpoint works by running curl http://localhost:3000/metrics and reviewing the output.
PromQL: Querying Your Metrics
PromQL (Prometheus Query Language) is what makes Prometheus truly powerful. Here are essential queries for web application monitoring:
- Request rate:
rate(webapp_http_requests_total[5m])— average requests per second over 5 minutes. - Error rate:
rate(webapp_http_requests_total{status_code=~"5.."}[5m]) / rate(webapp_http_requests_total[5m])— percentage of 5xx errors. - 95th percentile latency:
histogram_quantile(0.95, rate(webapp_http_request_duration_seconds_bucket[5m]))— the response time that 95% of requests fall under. - Memory usage:
webapp_process_resident_memory_bytes / 1024 / 1024— resident memory in megabytes.
These queries can be used directly in Grafana dashboards or as the basis for alerting rules. The ability to slice and dice metrics by labels makes PromQL incredibly flexible for debugging production issues.
Grafana: Visualizing Your Data
Grafana transforms raw metrics into meaningful, actionable dashboards. It supports dozens of data sources — Prometheus, Loki, Elasticsearch, PostgreSQL, InfluxDB, and more — making it the universal visualization layer for observability stacks. Grafana dashboards are defined as JSON, which means they can be version-controlled alongside your application code in your CI/CD pipeline.
Building a Web Application Dashboard
A well-designed dashboard answers critical questions at a glance. For web applications, the RED method (Rate, Errors, Duration) provides an effective framework. The following Grafana dashboard JSON defines panels for these core metrics, ready to import into any Grafana instance connected to Prometheus.
{
"dashboard": {
"title": "Web Application Overview",
"tags": ["webapp", "prometheus"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(webapp_http_requests_total[5m])) by (route)",
"legendFormat": "{{route}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"drawStyle": "line",
"fillOpacity": 15,
"lineWidth": 2
}
}
}
},
{
"title": "Error Rate (%)",
"type": "stat",
"gridPos": { "h": 8, "w": 6, "x": 12, "y": 0 },
"targets": [
{
"expr": "sum(rate(webapp_http_requests_total{status_code=~'5..'}[5m])) / sum(rate(webapp_http_requests_total[5m])) * 100",
"legendFormat": "Error %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
},
{
"title": "Response Time (p95)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 6, "x": 18, "y": 0 },
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(webapp_http_request_duration_seconds_bucket[5m])) by (le, route))",
"legendFormat": "p95 {{route}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"drawStyle": "line",
"fillOpacity": 10
}
}
}
},
{
"title": "Active Connections",
"type": "gauge",
"gridPos": { "h": 8, "w": 6, "x": 0, "y": 8 },
"targets": [
{
"expr": "webapp_active_connections",
"legendFormat": "Connections"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 500, "color": "yellow" },
{ "value": 1000, "color": "red" }
]
},
"min": 0,
"max": 2000
}
}
},
{
"title": "Memory Usage (MB)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 18, "x": 6, "y": 8 },
"targets": [
{
"expr": "webapp_process_resident_memory_bytes / 1024 / 1024",
"legendFormat": "Resident Memory"
},
{
"expr": "webapp_process_heap_used_bytes / 1024 / 1024",
"legendFormat": "Heap Used"
}
],
"fieldConfig": {
"defaults": {
"unit": "decmbytes",
"custom": {
"fillOpacity": 20,
"lineWidth": 2
}
}
}
}
],
"time": { "from": "now-6h", "to": "now" },
"refresh": "30s"
}
}
Import this JSON through Grafana’s dashboard import feature (the “+” icon, then “Import dashboard”). Each panel targets specific Prometheus queries and includes threshold-based coloring so problems stand out immediately. The error rate stat panel turns yellow above 1% and red above 5%, giving your team instant visual feedback.
Deploying the Monitoring Stack with Docker Compose
One of the fastest ways to get a complete monitoring stack running is through Docker Compose. A typical setup includes Prometheus for metrics, Grafana for dashboards, Alertmanager for notifications, and optionally Node Exporter for host-level metrics. Each component runs as a separate container, connected through a shared Docker network.
The Prometheus configuration file (prometheus.yml) defines which targets to scrape and at what interval. For web applications, you typically scrape your application’s /metrics endpoint every 15 seconds. In Kubernetes environments, Prometheus can automatically discover services through service discovery mechanisms, eliminating manual configuration as your services scale.
Grafana stores its configuration in a SQLite database by default, but for production deployments you should use PostgreSQL or MySQL for durability. Dashboard provisioning through YAML files enables you to define dashboards as code, deploying them automatically when Grafana starts. This approach pairs well with Infrastructure as Code practices where your entire monitoring setup is version-controlled and reproducible.
Alerting: Turning Data Into Action
Dashboards are valuable, but you cannot stare at them 24/7. Effective alerting bridges the gap between data collection and incident response. Prometheus Alertmanager handles alert routing, grouping, silencing, and notification delivery to channels like Slack, PagerDuty, email, and webhooks.
Designing Alerts That Matter
Alert fatigue is a real problem. Teams drowning in noisy alerts eventually start ignoring them, defeating the purpose entirely. Follow these principles for effective alerting:
- Alert on symptoms, not causes. Users care about error rates and latency, not CPU usage. High CPU might be perfectly fine if response times remain acceptable.
- Use appropriate thresholds. A momentary spike is different from a sustained degradation. Use
forclauses in Prometheus alerting rules to require a condition to persist before firing. - Include runbook links. Every alert should link to documentation describing what the alert means and how to respond. This reduces mean time to resolution (MTTR), especially for on-call engineers encountering an issue for the first time.
- Tier your alerts. Critical alerts (service down, data loss risk) page someone immediately. Warning alerts (elevated error rates, approaching capacity) go to a Slack channel. Informational alerts feed into daily reports.
The combination of well-designed alerts and collaborative project management tools like Taskee helps engineering teams track incident follow-ups and post-mortems effectively, turning monitoring data into continuous improvement.
Log Aggregation with Grafana Loki
While Prometheus handles metrics brilliantly, logs require a different approach. Grafana Loki, inspired by Prometheus, applies the same label-based philosophy to log aggregation. Unlike Elasticsearch, which indexes the full text of every log line, Loki only indexes metadata (labels), making it significantly cheaper to operate at scale.
Loki integrates natively with Grafana, letting you correlate metrics and logs in the same interface. When you spot a latency spike on a Grafana dashboard, you can click through to see the corresponding application logs — no context switching between tools. The query language, LogQL, feels familiar to PromQL users.
For web applications, structure your logs with consistent fields: timestamp, log level, request ID, user ID, service name, and the actual message. Structured logging (JSON format) makes it dramatically easier to filter and search through millions of log entries when debugging production issues.
Distributed Tracing for Complex Systems
As web applications evolve from monoliths to distributed systems, understanding request flow becomes increasingly difficult. A single page load might involve an API gateway, authentication service, user service, product service, and cache layer. Distributed tracing tools like Jaeger and Grafana Tempo provide visibility into these multi-service interactions.
OpenTelemetry has emerged as the industry standard for instrumentation. It provides vendor-neutral APIs and SDKs for generating traces, metrics, and logs. By instrumenting your application with OpenTelemetry, you can send data to any compatible backend — Jaeger, Tempo, Datadog, or New Relic — without changing your application code. This flexibility aligns well with DevOps culture where teams choose tools that fit their workflow rather than being locked into a single vendor.
Service Level Objectives and Error Budgets
Monitoring data becomes truly powerful when framed around Service Level Objectives (SLOs). An SLO defines the target reliability for a service — for example, “99.9% of API requests return a successful response within 500ms.” This creates a shared vocabulary between engineering and business stakeholders.
The error budget concept follows naturally: if your SLO is 99.9% availability, you have a monthly error budget of approximately 43 minutes of downtime. When the error budget is healthy, teams can ship features faster and take calculated risks. When the error budget is depleted, the focus shifts to reliability work. Grafana supports SLO tracking through dedicated panels that show error budget burn rate over time.
This approach to web performance puts concrete numbers behind reliability discussions, replacing subjective opinions with data-driven decisions about where to invest engineering effort.
Monitoring in CI/CD Pipelines
Observability should not start only after deployment. Modern CI/CD pipelines integrate monitoring checks at every stage. Performance tests during the build phase can catch regressions before they reach production. Canary deployments use real-time metrics to automatically roll back releases that degrade service quality.
Prometheus metrics can feed into deployment automation: if the error rate exceeds a threshold within the first ten minutes of a rollout, the deployment system automatically reverts to the previous version. This closed-loop approach to deployment reduces the blast radius of bad releases and gives engineering teams confidence to ship more frequently.
For teams managing complex web projects across multiple environments, having a reliable project management platform like Toimi alongside a robust monitoring stack ensures that observability improvements are tracked, prioritized, and delivered systematically rather than implemented ad hoc during incidents.
Best Practices for Production Monitoring
After implementing monitoring for dozens of web applications, several patterns consistently prove valuable:
- Monitor the four golden signals: latency, traffic, errors, and saturation. These four measurements, proposed by Google’s SRE team, cover the essential health indicators for any service.
- Use consistent naming conventions. Metric names should follow a pattern like
{application}_{subsystem}_{metric}_{unit}. Consistency across services makes it easier to build unified dashboards and reusable alerting rules. - Retain data appropriately. High-resolution data (15-second intervals) for the last 24 hours, downsampled data for weeks, and aggregated data for months. Tools like Thanos and Cortex extend Prometheus with long-term storage and global querying.
- Implement health check endpoints. Beyond the
/metricsendpoint, expose/healthz(liveness) and/readyz(readiness) endpoints. Kubernetes uses these for automated container management, restarting unhealthy instances and routing traffic away from pods that are not ready. - Dashboard hierarchy. Start with a high-level overview dashboard showing all services, then provide drill-down dashboards for each service with detailed metrics. This top-down approach helps on-call engineers quickly narrow down the source of problems.
- Correlate across pillars. The real power of observability emerges when you can jump from a metric spike to related logs to a distributed trace in seconds. Grafana’s Explore view enables this cross-pillar investigation workflow.
Choosing the Right Stack
The open-source monitoring ecosystem offers many options. Your choice depends on scale, budget, and team expertise:
- Small teams / single application: Prometheus + Grafana + Loki. Simple to deploy, covers metrics and logs, runs comfortably on modest hardware.
- Medium scale / microservices: Add Tempo for distributed tracing and Alertmanager for notifications. Consider Thanos for long-term metric storage.
- Large scale / multi-cluster: Evaluate Cortex or Mimir for horizontally scalable metrics, Tempo for traces at scale, and Grafana Enterprise for role-based access and reporting.
- Managed alternatives: Grafana Cloud, Datadog, New Relic, and AWS CloudWatch offer similar capabilities without operational overhead, trading cost flexibility for ease of management.
Regardless of scale, the principles remain the same: instrument your code, collect the three pillars of observability data, visualize it meaningfully, and alert only on actionable conditions. Start small with Prometheus and Grafana, then expand as your needs grow.
FAQ
What is the difference between monitoring and observability?
Monitoring is the practice of collecting predefined metrics and setting alerts based on known failure modes — it tells you when something goes wrong. Observability is broader: it is the ability to understand the internal state of your system from its external outputs (metrics, logs, traces). An observable system lets you investigate novel, unexpected problems by querying and correlating data in ways you did not anticipate when instrumenting the code. In short, monitoring answers known questions, while observability enables you to ask new ones.
How does Prometheus differ from traditional monitoring tools like Nagios or Zabbix?
Prometheus uses a pull-based model where it scrapes HTTP endpoints, unlike push-based systems. It features a multi-dimensional data model with labels that enable flexible querying through PromQL, has built-in service discovery for dynamic environments like Kubernetes, and stores time-series data efficiently. Traditional tools like Nagios focus primarily on host and service checks with binary up/down status. Prometheus excels at high-cardinality metrics in cloud-native, containerized environments where services are ephemeral and auto-scaling.
How much infrastructure does a Prometheus and Grafana stack require?
For small to medium applications (up to 100 services, 100,000 active time series), a single Prometheus instance with 4 CPU cores, 8 GB RAM, and fast SSD storage handles the workload comfortably. Grafana itself is lightweight, requiring about 1 CPU core and 512 MB RAM. The stack can run on a single server or within your existing Kubernetes cluster. For larger deployments exceeding millions of active series, horizontal scaling solutions like Thanos, Cortex, or Grafana Mimir distribute the load across multiple nodes.
What metrics should I monitor first for a web application?
Start with the RED method: Rate (requests per second), Errors (percentage of failed requests), and Duration (response time percentiles, especially p95 and p99). These three metrics cover the user-facing health of your application. Next, add resource metrics: CPU usage, memory consumption, disk I/O, and network throughput. For database-backed applications, include connection pool utilization, query execution time, and cache hit ratios. These metrics together give you a comprehensive view of both user experience and system capacity.
Can I use Grafana without Prometheus?
Yes. Grafana is a data-source-agnostic visualization platform that supports over 50 data sources. You can connect it to Elasticsearch for log visualization, InfluxDB or TimescaleDB for time-series metrics, PostgreSQL or MySQL for business metrics, CloudWatch for AWS monitoring, and many others. Grafana can query multiple data sources simultaneously, displaying metrics from Prometheus alongside data from a SQL database in the same dashboard. However, Prometheus remains the most popular pairing due to their complementary design philosophies and native integration.