SRE Interviews Test How You Think About Systems Under Pressure

SRE interviews are unique because they sit at the intersection of software engineering and operations. You need to write code, but you also need to explain how you'd keep a system running at 99.99% availability while the infrastructure is on fire. It's a blend of CS fundamentals, systems thinking, and hard-won operational wisdom.

Most SRE interviewers have been paged at 3 AM and dealt with cascading failures. They want to know you've been there too — or at least that you understand what it takes. Here are the questions you'll face and how to approach them.

SLOs, SLIs, and SLAs

1. What's the difference between SLOs, SLIs, and SLAs?

How to answer: SLI (Service Level Indicator) is the metric you measure — request latency at the 99th percentile, error rate, availability percentage. SLO (Service Level Objective) is the target you set for that metric — "99.9% of requests complete in under 200ms." SLA (Service Level Agreement) is the business contract with consequences — "If we drop below 99.9% availability, we refund 10% of the monthly bill." The hierarchy: SLIs inform SLOs, and SLOs inform SLAs. Most companies have SLOs even if they don't have SLAs. Make sure your SLIs actually reflect user experience, not just server health.

2. How do you choose the right SLOs for a service?

How to answer: Start with what users care about. For an API, that's availability and latency. For a data pipeline, that's freshness and correctness. Set SLOs based on user expectations, not engineering ego — 99.99% sounds great but means only 52 minutes of downtime per year. Is that actually necessary? Over-ambitious SLOs slow down development because the error budget is too tight. Start with a realistic target based on historical data, then tighten as the system matures. The right SLO is the one that balances reliability with velocity.

3. Explain error budgets. How do they influence engineering decisions?

How to answer: Error budget is the acceptable amount of unreliability — if your SLO is 99.9%, your error budget is 0.1% (about 43 minutes/month). When the budget is healthy, you ship features, run experiments, and take risks. When it's depleted, you freeze features and focus on reliability work. This creates an objective framework for the "ship features vs. improve reliability" debate. Instead of arguing opinions, you look at the error budget. It aligns product and engineering teams around a shared metric. Mention that error budgets should trigger automated policies — not just reports that get ignored.

Incident Management

4. Walk me through how you'd handle a production incident from detection to resolution.

How to answer: Detection (monitoring alert fires, or user reports). Triage (severity assessment — is this S1 all-users-impacted or S3 edge-case?). Communication (page the right people, open an incident channel, update status page). Mitigation (restore service first — rollback, failover, feature flag, scale up). Root cause investigation (only after service is restored — don't debug while the house is burning). Resolution and remediation. Postmortem (blameless, focused on systemic fixes). Emphasize "mitigate first, investigate later." Too many engineers try to find the root cause while users are suffering.

5. What makes a good postmortem? What makes a bad one?

How to answer: A good postmortem is blameless, detailed, and actionable. It includes a timeline, impact assessment, root cause analysis, contributing factors, and action items with owners and deadlines. A bad postmortem blames individuals, lists vague action items like "be more careful" or "add monitoring" (monitoring of what?), or never actually results in changes. The best postmortems make the system safer. The worst ones make engineers afraid to make changes. Google's postmortem culture is worth referencing — they publish them internally and treat incidents as learning opportunities, not failures.

6. How do you set up an effective on-call rotation?

How to answer: Sustainable rotation size (at least 5-6 people so nobody is on-call more than once every 5-6 weeks). Clear escalation policies. Well-documented runbooks for common alerts. Handoff processes between rotations. Compensation (time off, extra pay). Most importantly: if the on-call engineer is getting paged 10 times a night, the system needs fixing — that's not sustainable and it's a retention killer. Track on-call burden metrics. If alerts are noisy, dedicate sprint time to fixing them. On-call should not be punishment.

Monitoring and Observability

7. What's the difference between monitoring and observability?

How to answer: Monitoring tells you when something is wrong (known-unknowns — dashboards, alerts, health checks). Observability lets you understand why something is wrong, even for problems you've never seen before (unknown-unknowns). Observability is built on three pillars: metrics (quantitative — CPU, latency, error rate), logs (qualitative — what happened), and traces (contextual — the path of a single request through the system). A well-monitored system fires an alert when latency spikes. An observable system lets you trace that spike to a specific database query hitting a missing index on a specific table.

8. How do you design effective alerts? What's alert fatigue and how do you prevent it?

How to answer: Good alerts are actionable, well-documented, and map to SLOs. If the alert fires and the response is "do nothing" or "acknowledge and ignore," delete it. Alert fatigue happens when engineers get so many non-actionable alerts that they start ignoring all of them — including the real ones. Prevention: ruthlessly prune noisy alerts, use severity levels (page for critical, email for informational), create runbooks for every page-worthy alert, and review alert quality in postmortems. My rule: if an alert doesn't need human action within 30 minutes, it shouldn't page anyone.

9. Describe the Four Golden Signals. Why are they important?

How to answer: Latency (time to serve a request — separate successful from failed requests), Traffic (demand on the system — requests per second, transactions per minute), Errors (rate of failed requests — explicit errors like 500s and implicit errors like wrong results), Saturation (how full the system is — CPU, memory, disk I/O, connection pools). If you're only monitoring four things, monitor these. They cover the most common failure modes and give you a comprehensive view of service health. From Google's SRE book, but universally applicable.

Capacity Planning and Scaling

10. How do you approach capacity planning?

How to answer: Start with current usage data — traffic patterns, resource utilization, growth trends. Model future demand based on business projections (new features, marketing campaigns, seasonal peaks). Add headroom — typically 30-50% above peak expected load. Run load tests to validate your models. The common mistake is planning for average load instead of peak load. Your system needs to handle Black Friday, not an average Tuesday. Also plan for organic growth — if you're adding 10K users per month, your database needs to handle that in 12 months, not just today.

11. What's the difference between horizontal and vertical scaling? When do you use each?

How to answer: Vertical scaling means bigger machines (more CPU, RAM). Horizontal scaling means more machines. Vertical is simpler (no distributed systems complexity) but has limits — you can't buy a server with 10TB of RAM. Horizontal is more complex (need load balancing, data sharding, stateless services) but scales almost indefinitely. In practice: vertically scale your database as long as possible (it's simpler), horizontally scale your application servers from day one (they're usually stateless). Horizontal scaling requires thinking about data consistency, session management, and distributed coordination.

Toil Reduction and Automation

12. What's toil? How do you identify and reduce it?

How to answer: Toil is manual, repetitive, automatable work that scales linearly with service growth and has no lasting value. Examples: manually rotating certificates, provisioning accounts, restarting stuck services, running manual data migrations. Google's SRE practice says engineers should spend no more than 50% of their time on toil — the rest should be engineering work that permanently improves the system. To reduce it: track time spent on toil, prioritize automation by frequency and time cost, and make it a first-class engineering objective, not a side project.

13. Describe an automation project you've worked on. What was the impact?

How to answer: This is your chance to show real engineering impact. A good answer covers: what the manual process was, why it was painful (time, error-prone, scaling issues), what you built, and the measurable outcome. "We were manually provisioning development environments — took 4 hours each and we did it 3 times a week. I built a Terraform module with a CLI wrapper that reduced it to a 10-minute self-service process. Saved about 10 engineer-hours per week." Numbers matter. Show the before and after.

System Design for Reliability

14. How do you design a system for high availability?

How to answer: Redundancy at every layer — multiple application servers behind a load balancer, database replicas with automatic failover, multi-AZ or multi-region deployment. Eliminate single points of failure. Circuit breakers to prevent cascading failures. Graceful degradation (serve cached data when the database is slow rather than returning errors). Health checks and automatic remediation. The key insight: high availability is a design principle, not something you bolt on later. It needs to be in the architecture from the start, and it always comes with tradeoffs — usually cost and complexity.

15. What's a circuit breaker pattern? When would you use it?

How to answer: A circuit breaker monitors calls to an external service. If failures exceed a threshold, the circuit "opens" and immediately returns an error (or fallback) instead of trying the call — preventing cascading failures and giving the downstream service time to recover. After a timeout, it enters "half-open" state and tries a few requests. If they succeed, the circuit closes. Use it for any external dependency — APIs, databases, third-party services. Without circuit breakers, one failing service can take down your entire platform because all threads are waiting on timeouts.

Coding and Tooling

16. Write a script that monitors disk usage and alerts when it exceeds 80%.

How to answer: This tests basic scripting ability. Write it in Python or Bash. Use df to check disk usage, parse the output, compare against threshold, send alert (email, Slack webhook, PagerDuty API). Handle edge cases: multiple mount points, NFS mounts that might be slow to respond, alerting cooldown so you don't spam. The interviewer isn't looking for production-grade code — they want to see you can write a quick operational script and think about edge cases.

17. How would you troubleshoot a Linux server that's running slowly?

How to answer: Systematic approach: top/htop (CPU, memory overview), vmstat (memory pressure, swap), iostat (disk I/O), netstat/ss (network connections, too many TIME_WAIT?), dmesg (kernel errors, OOM kills), check disk space (df -h), check running processes (ps aux), check system logs (/var/log/syslog, journalctl). The methodical approach matters more than memorizing commands. Start with the broadest view and narrow down. Mention USE method (Utilization, Saturation, Errors) for systematic analysis.

Culture and Philosophy

18. How do you balance reliability work with feature development?

How to answer: Error budgets. When reliability is good (error budget healthy), prioritize features. When reliability suffers (error budget depleted), prioritize reliability work. This removes the subjective "we need to be more reliable" vs. "we need to ship faster" debate and replaces it with data. Also: embed reliability requirements into feature development — every new feature should include monitoring, alerting, and capacity considerations. Reliability isn't a separate workstream; it's part of building software correctly.

Preparing for SRE Interviews

Read the Google SRE book — not cover to cover, but at least the chapters on SLOs, error budgets, monitoring, and incident management. These concepts come up in every SRE interview, regardless of the company.

Have 3-4 detailed incident stories ready. Real incidents you've worked on, with specifics: what broke, how you detected it, how you mitigated it, what the root cause was, and what changed afterward. These stories demonstrate operational maturity better than any theoretical answer.

Practice explaining operational concepts clearly. SRE interviews often include "explain X to a product manager" questions, and your ability to translate technical concepts into business impact is a differentiator. Try Craqly's AI interview copilot to practice explaining SRE concepts under time pressure — it's the closest thing to a real interview loop without actually being in one.

Also, brush up on coding. Many SRE teams expect candidates to pass the same coding bar as software engineers, just with a focus on systems problems (parsing logs, analyzing metrics, automating workflows) rather than algorithms. Get started with Craqly to build a prep plan that covers both the operational and coding sides of SRE interviews.

SRE Interview Help: Top Questions on Reliability Engineering