Site reliability engineering
Keep prod reliable without burning out the team. SREs who measure, not hope.
Alert fatigue and postmortems that change nothing are symptoms of missing measurement, not missing tools. We place SREs who define error budgets on checkout and API latency, trim noisy Prometheus rules, run game days on failover paths, and gate releases when burn rate spikes. On-call gets quieter because pages tie to user-visible failure, not infrastructure trivia.
Talk about your reliability gapsapiVersion: sloth.slok.dev/v1kind: PrometheusServiceLevelmetadata: name: api-availabilityspec: service: api-gateway slos: - name: requests-availability objective: 99.9 sli: events: errorQuery: sum(rate(http_requests_total{status=~"5.."}[5m])) totalQuery: sum(rate(http_requests_total[5m])) alerting: pageAlert: labels: severity: page annotations: runbook: https://wiki/runbooks/api-availability Core stack
- SLOs & error budgets
- Prometheus & Grafana
- PagerDuty / Opsgenie
- Kubernetes reliability
- OpenTelemetry
- Terraform & observability-as-code
5+
Average years in production SRE
Engineers who've carried pagers and led incidents, not only configured dashboards.
Deep-Dive Tech Stack
Reliability spans how you define good, how you detect bad, and how you respond before customers feel it. We match on observability, incident practice, and Kubernetes operations as one operating model, not a dashboard project.
-
SLOs & error budgets
SLIs on user-visible paths and budget policies that gate releases in CI. Partial failures count: checkout can succeed while search degrades, and that still burns budget stakeholders agreed to watch.
-
Prometheus & Grafana
Recording rules for SLO compliance and multi-window burn-rate alerts that page early without noise. Alerts nobody acted on in six months get deleted; new pages require runbook links before they fire.
-
PagerDuty / Opsgenie
Escalation paths with timezone-aware handoffs and severity-based routing. Rotations balance load so follow-the-sun does not mean someone is always exhausted.
-
Kubernetes reliability
Pod disruption budgets during node drains, HPA tuned to real traffic, readiness probes that catch slow starts, and canary or blue-green rollouts. StatefulSet issues like PVC attach delays and stuck terminating pods are handled with patience, not destructive kubectl habits.
-
OpenTelemetry
Trace propagation across sync and async boundaries with service maps for triage. Sampling follows service criticality so observability spend does not double when trace volume does.
-
Terraform & observability-as-code
Alert modules and dashboard JSON in Git with staging that mirrors prod for load tests. Monitoring drift is caught in PR review, not when prod stops paging after a console edit.
-
Chaos engineering
Game days inject latency, pod kills, and AZ failures before customers find the edge case. Hypothesis-driven experiments produce tracked remediations, not resilience slide decks alone.
-
Loki & log aggregation
Label-based log queries correlated with traces and metrics, retention tiers by service criticality, and structured JSON from applications. Incident triage starts from one query across pods instead of SSH and grep across a dozen nodes.
-
Runbooks & postmortems
Executable runbooks linked from alerts, blameless postmortems with tracked action items, and error budget reviews that change release policy when burn rate spikes. Reliability improvements ship as code and process, not as slides filed after the outage.
Reliability outcomes we optimize for
- Average years in production SRE
- 5+
- Typical alert noise reduction
- 40%+
- Availability targets we've helped define
- 99.9%+
- From access to first SLO draft
- Days
Engineers who've carried pagers and led incidents, not only configured dashboards.
After SLO-based alerting and runbook hardening on engagements we've supported.
With error budgets, release policies, and measurable SLIs, not vanity uptime charts.
For scoped services with existing metrics, not months of discovery before anything ships.
SRE staffing questions we hear often
How do you handle time-zone crossovers for on-call?
On-call overlap is explicit in the contract. We align escalation windows with your core team and document handoffs so incidents spanning zones don't lose context between shifts.
Do SREs replace our DevOps or platform team?
No. They complement platform engineering with reliability practices: SLOs, incident response, capacity planning, and postmortem follow-through. Your platform team stays the owner of infra delivery.
What is your approach to alert design?
Alerts tie to user-visible symptoms and SLO burn rates, not CPU graphs nobody acts on. Every new alert ships with a runbook link and an owner before it pages anyone.
Can you work inside our existing observability stack?
Yes. Datadog, New Relic, Prometheus, or cloud-native tooling. We don't require a migration to staff SREs.
How do you handle post-incident follow-through?
Postmortems produce tracked action items with owners and dates. We don't stop at a blameless doc in a folder nobody opens again.