What they don't tell you about working for yourself is the fact you can be effectively on-call 24x7 every day. I am currently supporting four wineries that are processing thousands of tonnes of receivals 24x7. It happens for two months of the year and I am expected to be available from 06:00 to 22:00 during that time, there is no phoning in sick or having a lazy day, I work alone and only have one reputation. I don't want to be that contractor forever known for destroying a clients business.
You can only do this for so long though, when two or three problems come in simultaneously it can cause issues as you drop something halfway through when something more important comes in. I once executed an SQL update query without a where clause under this kind of pressure, and ended up working until the next morning to recover, only to start again at 6AM. I have even had land-line calls at 2AM to bypass my mobile restrictions. The rewards are great, but don't let anyone tell you it is always easy.
My current system is 16 years old now and I know all the ins and outs so it has been pretty easy to keep on top of things the last several years, however I am glad the replacement system is nearly written and it will be somebody else problem in 2026.
In your case, you're the only employee of your business. And if you're not there the business will literally go under. And you also get directly rewarded for being there. I would guess that being 'on call' in this manner is possibly less draining on a person's soul (depending on how well they tolerate the risk of owning a business).
Contrast that with being 'on call' for your megacorporation, who isn't giving you anything extra for your on call time because they 'already pay you enough'. And where the only negative consequence for the company if you fail to immediately respond within 15 minutes is that some executive in the company is kept waiting longer than 15 minutes, or some ads aren't being shown for 15 minutes.
But if you aren't there, your boss is going to get a phone call and that's definitely not going to look good on you. And there's no bonus for fixing the problem, that was already your job in the first place. Sucks that you had to do it outside of scheduled hours, oh well.
I'm with the author of this article. Take your on-call rotation and shove it (if you're a large corporation). I'm fortunate enough to be able to take a firm stance on this point, and do so happily.
> And you also get directly rewarded for being there. ... Contrast that with being 'on call' for your megacorporation, who isn't giving you anything extra for your on call time because they 'already pay you enough'.
I'm really not seeing the distinction here. If a company offers a salary and includes on-call as part of the deal (and communicates that up front so it's not a bait and switch), how is that different than running your own business and getting compensated for your on-call time as part of a package that you sell to a client? In both cases you agree up front that you will be part of an on-call plan. In neither case are you getting a bonus for doing a good job at on-call, because either way you're just doing your job that you committed to ahead of time!
I'm totally sympathetic to people who don't want to be part of an on-call. Jobs that have on-call aren't for you, that's fine. But I don't get this idea that it's uncompensated labor, unless there are tons of people out there who somehow ended up in jobs that sprung on-call on them without warning.
Let's say I'm a business owner and I'm frustrated with the current state of the on-call system. I have options.
I can try negotiating with my clients to lessen the load in some way. Obviously this isn't always possible but it often is. I once had a freelance project that required 24 hours of on-call after a release. I negotiated release days that were convenient for me (never Fri/Sat/Sun). One time the client pushed back, I pushed back harder, and I won. In order for my push back to work I ensured that I had enough negotiating strength to do that which I planned for ahead of time.
I can upgrade my systems. For example if my current paging system is insufficient I can choose to pay $10/month more for another system that makes my life easier. I can set aside time to refactor my alerts code to make my life easier and I don't need to justify it to anyone but myself.
I can straight up refuse to do on-call and deal with the consequences to my business. Freelancer developers do this all of the time. We choose which client work to do and not to do. We can make these choices arbitrarily. Sometimes it's seasonal. Sometimes it's just based on vibes. Doesn't matter; it's our company.
Meanwhile the average on-call engineer at a large company has none of these freedoms. The underlying systems are chosen for them and they just have to deal with it.
This story is from an indian company, the on-call "expectations" might be different in your country. We were responsible for a 400k RPM service. It handled ads, so it was fairly important to the business. Whenever I had to go out for a night, or go out for a family event, or whatever, I was always able to hand over on-call to a team member for that duration. Of course, this also happened during other's on call, where I would take over. In fact, this happened daily! From ~7pm to ~9pm every day I would play football or whatever. I would always hand over on-call to another team member during this time. I usually wake up earlier than others, so I used to respond to alerts during those hours regardless of who was on call. The nights where I was staying up to watch champions league football matches or some other reason, I would take on-call as well. We just set up pagerduty's escalation order appropriately. Probably helped that there were just 5 of us in the team - easy coordination. Of course, this was my first job, and I messed up quite a bit, but I noticed the others following a similar system without me as well.
It also depends on the nature of the alerts I suppose. For us, the majority of the alerts could be checked and resolved from a mobile phone (they are alerts that could strictly speaking be resolved in an automated fashion, but the automation would get complex enough with dependencies on other service's metrics that we wouldn't be sure of not having bugs in _that_ code). About once a week or two weeks we would get an alert that needs us to look at the logs and so on.
> Meanwhile the average on-call engineer at a large company has none of these freedoms. The underlying systems are chosen for them and they just have to deal with it.
In most cases they have all of those freedoms, and the only barrier is one that's shared with the self-employed person: not liking the consequences of choosing those options.
They could negotiate with their manager to lessen the load. They could upgrade the systems. They could straight-up refuse on call.
They don't because they don't like the consequences of taking these options—and neither does the self-employed person!
> not liking the consequences of choosing those options
Correct. "Shove it" is usually preceded by not liking something.
> They could negotiate with their manager to lessen the load.
Most of the time the manager will simply refuse. As a business owner it's my decision.
> They could upgrade the systems.
At big companies this is usually outside the scope of an on-call engineer. The on-call engineer often doesn't even have commit rights to that repository.
The specific example I gave was paying $10/month more. That can be a very hard sell at a large company because their service contracts are much more complicated/expensive.
> They could straight-up refuse on call.
A business owner has much more negotiating power than an employee does.
> They don't because they don't like the consequences of taking these options—and neither does the self-employed person!
In the vast majority of cases making changes to the on-call infrastructure has very little (if any) measurable impact on the business. Like spending a week making the systems better. Or changing deploy/release dates to be more convenient.
As a business owner I can take advantage of this and make my life easier.
As an employee I have layers of bureaucracy to wade through and will probably be refused. Not because it affects the business but for other reasons.
Do others not generally get extra pay for the time on call?
I have the enviable situation where I am on call for half of every month, I get paid significantly extra for this, and there's maybe 1 emergency call per year.
The big difference is as an owner you are fully in control of allocating your time, and so if out of hours workload is becoming too much, you can choose to not work on other things in favour of fixing that. In the corporate world, there's some manager who weighs up spending two weeks to properly fix an issue or automate a process vs just making their workers unhappy and doing something that will make the manager look good in the internal politics, and often will insist on the latter.
But that's the problem with being in a bad company no matter what. If it weren't on call it would be something else that that manager would be making you suffer through. That doesn't mean on-call as a concept is terrible or that it's uncompensated labor, it just means that bad managers are capable of screwing up your life with any tools that you give them.
Bad managers are hard to predict. Even if you like your manager now, there's nothing to say they can't be promoted/reassigned/quit and you get a new manager that sucks. On-call being a thing is easier to get an answer on ahead of time
But it's meaningless a signal for bad managers because basically everyone does on call in some form. That's what this whole discussion is about: how ubiquitous it is.
The only job I worked that didn't have a formal on-call rotation ended up with me unofficially on on-call, with the same expectations as though that had been set up up front: boss calls whenever he calls and expecting an answer, and I'm left deciding how badly I want this job. Turns out management there sucked and I ended up on on-call after all.
If you find a company that actually has a good story for why they're able to get away with no on call, that might be a good signal. But if they're out there I'd love to hear from them, because most people here are just speculating about better alternatives, not speaking from experience with ones that actually worked.
At Amazon it’s common to have terrible, week-long oncall shifts with many repetitive pages. At Google they have shifts that follow the sun and they get PTO to compensate for being oncall during the weekend. Both jobs pay similarly. And I think most people joining Amazon don’t know about the oncall they are in for.
This ultra-libertarian take is consistent but not realistic. Realistically, the group decides that some amount of work or sacrifice is not compatible with having a good life, so laws are passed that either disallow such extremes or mandate extreme compensation.
My friend worked for Amazon in India (software). He was often on 24x7 "on calls", which is touted as "good" here (because how else will you "learn"), during his 3rd or 4th week. By third night he was vomiting and had to visit the doctor. His manager called and had asked whether he had brought his laptop with him. His mother forced him to resign next day (he is from an extremely rich kind of family though). It is common here in most companies and among famous MNCs it is especially known in Amazon and Uber in India.
What shocks more is these are the companies that can "follow the sun" w/o breaking a drop of fucking sweat!
I have lost too many interviews just because I clearly asked for this, I always do, and I am doing it even now while I have been without a job after taking a year gap (which makes getting calls already difficult esp. with this AI and vibe onslaught). I am not giving up on this. I personally have never agreed to this which has caused lots of confrontations and stress(!!); a major source of my burnout WITHOUT ever doing the 24x7 on-call - so by just fighting it and keeping it away from me alone I was burnt out to the bits. It took me finally seeking medical advice to realise I was burnt out.
I hope this is not sounding like dramatic but even now when I have been resting, travelling for a year the mere mention of words like Splunk, VictorOps (same as Splunk iirc), PagerDuty give me minor trigger attack kinda sensation - make me very agitated.
But this is so common here. So common that it is considered one of the realities, truths. Yet, I have never understood, how, how can one agree to this? How? Is it some kind of social (if not racial) slave mentality? Is it some kind of grand coercion that they have no escape from? Or maybe it's just generation after generation subjecting the next generation to what they were subjected to while the stakeholders in the richer countries (because that is the structure) demand of this implicitly as they are stopped by health and safety laws from subjecting their underlings in their own developed home nations maybe.
It's like demands from tech executives for long hours: "I worked long hours to make myself rich; why won't all of you work long hours to make me richer?"
I somehow do not think OP is working for themselves. They are a contractor there. I do believe contracting is just being an employee on slightly different terms.
> I work alone and only have one reputation. I don't want to be that contractor forever known for destroying a clients business.
> I am currently supporting four wineries that are processing thousands of tonnes of receivals 24x7... I don't want to be that contractor forever known for destroying a clients business.
If the 'clients' being referred to are the wineries, then it sounds like a self-run company. IE: the company is operating as a contractor to the winery clients, for whom (the wineries) a failure of the contractor (the company) would be a disaster for their business (the client, the wineries).
OTOH if "clients" refers to the business (the company) that in turns does the support for the wineries, then yes - an individual contractor.
The distinction would seem to me to really change the entire tenor of the comment. I'm curious which it is.
Would it make sense to hire someone? When you run a solo person business getting over the mental hump about hiring someone is difficult. If rewards are great and things are stable (16 years is stable) what is preventing you from hiring someone to help with at least some aspect?
I thought about hiring somebody five years ago, but the fact is my client has gone itself from a small/medium sized business to a corporate, from no IT staff to now thirty and they are working to replace the production systems I wrote the middle of this year. To be frank having that much responsibility as a one man business is quite scary. Also when I originally wrote the systems for them I had no clue it was going to get this busy and no clue I would be 24x7 support, also at 62 years old I really want to start winding down :-)
For "non-exempt" employees, that's paid "stand-by time" California.[1] Also see this case involving on-call coroners.[2]
The way this works in most unionized jobs is that there's a stand-by rate paid for on-call hours, plus a minimum number of hours at full or overtime pay, usually four, when someone is called to duty. This is useful to management - if the call frequency is too high, it becomes cheaper to hire an additional person.
Excellent article. I can relate to a lot of it. The sad part is that we can't even control the quality of the systems we're oncall for. We're pushed by management for new features, not for robustness of the tools. Also some systems have no clear ownership, so nobody has an incentive to fix them. It'll be next oncall's business. Oncall is really the worst part of my job. I can stand long hours but this is something else.
One of the sidebars mentions that: "The production system in question is almost certainly a schizophrenic box of compromises brought about through poor decision-making, unaddressed technical debt, design-by-committee, and impossible timelines and budgets. This is not a system that any single rational human being on the team would’ve chosen to build if permitted to do so alone. Trying to assert ownership over an environment like that is just begging to get your shit rocked."
That’s basically every company and every system ever. Things are always in a state of flux, constantly being worked on. People come and go, priorities change and technologies evolve.
> Also some systems have no clear ownership, so nobody has an incentive to fix them.
It’s even worse when the system isn’t business-critical: a reporting service, a manual intervention tool, something that quietly supports a process. When it fails, everyone is affected, but no one is accountable.
Ironically, these are often legacy systems that have been rock-solid for years — so reliable they’re forgotten… until they break.
I've been on call for almost 20 years. If a system is crashing often and disturbing your sleep but is not being prioritized to get fixed, then stop answering the calls. If it was important last night then it's still important this morning.
Be vocal and say "I will no longer respond when system XYZ goes down unless serious efforts are made to fix it."
If you get push back explain that you will also call the person telling you it's not important enough to fix each time it pages, and be willing to do so. What can they say?
If your case is valid you won't get fired for standing up for yourself. You might take some political damage but the guy willing to waste your sleep time was going to lay you off/betray you anyway.
I feel for you, I’ve also suffered through this a lot over the years, and am finally at the stage of career and wisdom to start pushing back on the quality that I can’t control and ensuring that others are equally as accountable for their mess.
For one particular occasion , once we took blame out of the equation (at least within the engineering team) and started doing Post Incident Reports, the incentives finally became clear for the business as we were able to compile a list of recurrent issues during every issue, calculate a financial loss and present it for inspection each and every time they either began a witch hunt for downtime or refused to allocate time to backlog. Small wins.
"Work 5 shifts per week to monitor the NERSC HPC Facility, which includes 2 - 3 OWL (midnight - 8am) shifts. Some days may be onsite, some may be offsite. The schedule will be determined by staffing needs."
40 hours per week, full salary, full disclosure about the night shifts, but none of this 24x7 wake up in the middle of the night on top of your regular job bullshit that the tech industry insists on.
there's nothing wrong with shift work, and there's nothing wrong with the graveyard shift (for people who want that job), but there is a LOT wrong with alternating day and night shifts in the same week every week, and nobody should agree to do that. (perhaps if you are in your early twenties you feel like you can handle it, but I'm not sure that's a good idea either)
I tried graveyards and swing shifts in my twenties. I only did it for a year, but it wrecked my sleep schedule for years afterward; I struggled with insomnia because of it. Not sure there is any age where this works.
Some people can handle it better than others. I'll never do it again though.
some of the people I know who've made it work, and I'm not recommending this either, are "hardworking ambitious immigrant" types who do it to hold down two full time jobs. tends to be in industries where the overnight shift is a quiet shift rather than something like running the same factory at full tilt overnight.
My natural rhythm makes me a nocturnal being, while the society works 9 to 5. Once I realized this, it became obvious why I feel like fuck all the time. Can't wait to retire.
Shift work sucks more than on call for me. I like having a flexible schedule that I can work whenever I want, and I'll happily take the remote risk that I'll get paged during my one week per quarter on call rotation in order to guarantee that no one expects me to be on during set hours the rest of the year.
I think the problem isn't on call itself, it's that a lot of companies suck at on call. If your rotation is every 3 weeks and you get woken up in the night at least once per rotation, then yeah, that's awful. But the problem you have in that case is that stuff is always on fire and you don't have enough people on the rotation, not with on-call as a concept.
I'd take a night-shift job over being on call ontop of a 9-5.
I think the real answer here is to have team members on multiple continents. Have some team members NA/EU/Asia then it's always reasonable hours for someone to deal with the production problem. High priority issues can be worked on around the clock without anyone working overtime.
That's still shift work. It's still the company assuming that I'll be available during N specific hours of the workday in order to fix issues.
Look at it this way:
If I work a job where I'm expected to be on 9-5, 46 weeks per year, 40 hours each week, that amounts to 1840 hours of scheduling my life around my employer.
If I work a job where I can schedule my work however I like and also have on call 1 week out of every quarter, the worst case scenario there is 672 hours of scheduling my life around my employer (and in practice the demands of on call in my current rotation are far less than that). The rest of my life I can schedule as I please, so long as I do my job.
I would rather take the option that minimizes the number of hours where my employer gets to tell me where to be.
I know plenty of friends (mostly medical) who've had "on call" shifts on top of their 9 to 5, but in those cases it was pretty exceptional for them to actually receive a call, and the "interruptions" would be a small number of hours.
Here's an idea: Compensate any on-call work received during off hours at 10X the normal hourly rate. E.g., if my salary is $150K per year, then my hourly pay rate is about $75 per hour, so compensate my on call work at a rate of $750 per hour. Thus if I get a call at 10pm, log in to my laptop and work for 30 minutes to resolve the issue to a satisfactory level, then I pocket $375. That puts a financial incentive on companies to structure their on call protocols so that only the most important calls are handled. And I can envision variations on this theme. Different sorts of on-call disasters could offer bids for how much they're worth to fix based on some automated rubrick, and anyone on the ENG team could pick these up on a first-come, first-serve basis. Or various combinations of the above for a guaranteed backup person. But the companies should offer enough incentive to make it worthwhile. And this is in the companies' own best interest. To maintain a workforce that can think clearly during the normal work, to have a good reputation in the industry, to get good reviews on Glassdoor, etc.
Many systems that pay hourly for task-based work like this deal with this problem by instituting a minimum number of hours of pay per-instance, which is usually higher than the expected time it takes to complete a typical quick task.
That way, by taking longer on any but the hardest issues, you are instead removing your ability to make more money on other, faster issues.
If you call out a master electrician to flip a circuit breaker, they are going to charge you a lot more money than for the half second it took to flip the switch.
Also, if the reason they have to come flip that switch is that they screwed up the job they did earlier that week, you don't get charged at all.
This thread is full of people acting like highly experienced trade workers are idiots who have never thought of how hourly work might be gamed for more money. All of this has been long since solved by the industries that actually operate this way.
> Many systems that pay hourly for task-based work like this deal with this problem by instituting a minimum number of hours of pay per-instance, which is usually higher than the expected time it takes to complete a typical quick task.
That's how it works for the occasional on-the-side server/printer tech jobs I occasionally take (long story short: I took a temp IT job years ago, resigned to go somewhere else, but the company never took me off their payroll so I get the occasional call to go install some number of printers or some number of servers/switches/etc. for some customer of HP or Dell, respectively). The usual rates are pretty abysmal for someone of my experience and skill level, but the 4-hour minimum means that if I can bang out one of these jobs in an hour or less I'm making more per-hour than at my day job. Nice bit of occasional money to blow on craps or penny stocks or shitcoins or whatever, and it keeps my fingers on various industry pulses.
Have a minimum of say, 60 minutes, and if that is exceeded, the issue gets escalated or deferred. If deferred, presumably to the next day shift, the cost is limited. If escalated, the second person must also defend the time spent. If management still doesn't trust their workers to be honest, then the company has other issues that tweaking on-call will not solve.
> If on-call engineers were to receive compensation for each incident they resolved, it would incentivize them to intentionally build systems that fail so they could increase their pay by increasing their on-call load.” My guy, that is sabotage and fraud. You are hypothesizing a scenario where your subordinates are committing actual crimes. If somebody is doing criminal acts at work, fire their ass! Not to mention that anybody who deliberately self-inflicts on-call load is a goddamn idiot and should be sacked just on that basis alone.
It's a matter of degree.
Sabotage doesn't always mean instant breakage.
In "The Simple Sabotage Field Manual" we learn to slow down the organization by things like "Bring up irrelevant issues as frequently as possible", "Work slowly. Think out ways to increase the number of movements necessary on your job.", and "Do your work poorly and blame it on bad tools, machinery, or equipment."
If you are the person who consistently takes 10x longer than everyone else to fix issues, then someone has a conversation with you. It's not that hard.
Good suggestion and I can see the benefit for honest people, but unfortunately it’s as equally a system for financial abuse for others - sometimes enough to prevent people fixing things during their regular hours just to benefit at other times.
A good counter balance to this might be to offer even more compensation for no incidents, or otherwise well handled incidents that go on to squash types of that incident now and into the future.
Not necessarily, but agree that bad managers or organisations have a tedency to do this.
I guess the ultimate goal is to keep everyone happy right? Everyone has different ideals, you can probably assume the business wants everything to work by spending as little as possible, employees want to be paid as much as possible while enjoying their job. Striking that balance is always a challenge.
Some long haul truckers have a relatively workable pay scale. If it wasn’t for how far they expect you to drive in 24 hours, it might be considered good.
The general shape of it at least makes sense. You get paid when you’re on the road. More if you’re driving than if you’re parked, and more if you drive for more than your 40 hours a week.
It is also in my interest to skip spotting bugs during code review so I can look like a genius when I fix them when they cause issues in production. Of course I don't do that because this is incredibly stupid and I have better ways to spend my time.
10x pay is significant. Personally, I would remediate asap even with $1500/hour on the line. I do have ethical standards.
But would I be as motivated to stop the root cause of the recurring issues that keeps giving me an extra $3k every shift? Well, I probably would, but my coworkers already need to be goaded into fixing root causes, and they're not getting paid extra!
In general, I think conflicts of interest must be handled carefully, and ideally avoided. Paying 10x wages when incidents happen is a clear conflict of interest.
That’s true for any hourly paid job. Employers can choose to fire those who don’t work efficiently enough. What they can’t do is not pay people for hours worked —and with tech it’s easy for them to tell how long you are logged in for, to avoid them underpaying you.
Having overtime pay that is a multiple of regular hourly rate is mandatory is many countries in Europe. Are you saying that European software tends to be more obfuscated? (answer: it is not).
Employees are also subject to the Working Time Directive in EU countries which sets limits to the amount of overtime that is permitted in a week and in a month. Unfortunately in most countries it's full of loop holes.
One of the biggest problems with on-call rotations is that you're actually incentivized to do it poorly. Every minute you spend doing on-call work is time that you can't spend on the things you've actually been assigned to do. You're never going to put your on-call work into your performance reviews; doing so might actively work against you. "I see that you spent time tuning alerts and updating the runbook. That's time you didn't spend on the actual tasks that were assigned to you."
If it's better to spend the least amount of time doing on-call work then the logical conclusion is that it's best to snooze as many alerts as possible until they either go away on their own or roll over past your rotation. Fixing the underlying problem might be worthwhile if it's something that you can fairly easily fix but if the on-call rotation is more than 2 people, the underlying problem is mathematically unlikely to be of your making and is it really a good idea to make a habit of fixing other people's broken code?
What's crazy is that I've never seen anyone with on call duties acting in this worst case bad faith manner. Companies basically abuse the work ethic of their employees because it's the cheapest possible way to check that box.
> Every minute you spend doing on-call work is time that you can't spend on the things you've actually been assigned to do.
In my experience at least if you're oncall during a sprint you would have less work assigned to you than otherwise (2 week sprint and 1 week you are oncall? 50% allocation) as the expectation is that week you will spend responding to alerts, or investigating issues, or even improving alerting and dashboard and fixing bugs. If this does not happen, devs don't push for it and management is completely blind to it you have an organization issue. If leadership does not care about the problem it's time to jump ship ASAP.
But I've seen people stubbornly defending an alert on >60% CPU usage of their 1 CPU allocated kubernetes pods where there was no impact in p99.9 latency (which was measured and was the actual metric that mattered as agreed with the rest of the business and internal customers of the service). Or alerting on each single pod restart. That is self inflicted pain.
That article made me shudder with echos of having what we used to call “beeper madness” back in the 1990s. After a while of being on a roster of on call weeks, anything that beeped would make you reach for that pager on your belt.
As a kid the first few weeks were kind of exciting as it felt like you had been elevated to a new level of responsibility. Once that wore off it was obvious what a cage it was.
I refused a contract to hire that was talking pager duty.
I saw how they freaked out about things outside the team’s control during business hours. The first time someone called me after 11 pm I was going to get myself fired talking to them.
" insure against every possible thing that could ever go wrong, they would have to build a second studio on a separate part of the city’s electric grid, with redundant copies of all the equipment and broadcast content, along with a full crew of understudies ready to take over at a moment’s notice."
WTH?? I guess this person has never heard of backup generators? Every broadcast TV station has them.
To begin with, airplanes do fall out of they sky, sometimes right on your backup generator. (speaking from military experience where yes, there is an entire backup studio waiting to take over, just in case. Or rather, the 'studio' is geographically dispersed, with 100% redundancy, which is another way of saying the same thing.)
But more importantly, *this* is what you noticed from that article?
No, but this was the point where the "let's make something up" got to be too much.
I'm not talking about low budget UHF channels, but TV stations I've been in and around all have multiple studios. If the switcher in Studio-A goes down, the signals can be routed to the control room for Studio-B. Also, Alex the know-it-all is such a forced thing that is just ridiculous and eye roll inducing. Anybody that is a jack of that many trades is a master of none of them. The entire forced analogy just got to be too much and I lost interest before a point was ever made.
Going to a remote is not the same thing as backup/redundancy within the studio. The broadcast was never interrupted. The latency between remotes can be mindnumbing, and with inexperienced reporter/anchor stepping over each other unable to sit through the delay it's pain inducing for the viewer. But that's an unrelated tangent
PTSD. I was that guy dragging his bagpack everywhere. A drink at a bar on Saturday ? Sunday lunch with the in-laws ? with my bag pack, ready to bust the laptop in case of emergency.
My two personal lows... I had to pull the laptop at my own birthday party, in a restaurant. And at a funeral (not proud on that one).
That seems like a very long-winded way to say you hate on-call, which is a completely normal thing to do. That said, is on-call effectively mandatory or very popular in the US startup world? Because here, in the European established company world, I can’t really recall seeing a job posting with on-call listed.
In Finland you usually get a static pay just for being on call, meaning you're X minutes away from a company-approved device you can use to fix things if they break.
If there's an actual alert you need to respond to, your pay goes to 2x hourly rate, or 3-4x if it's the weekend and there usually is a minimum amount of billing you do if you have to do any work, usually 30-60 minutes. So if you get an alert and you fix it by pressing the "fix it" button on the dashboard on a Sunday, you just got paid hundreds of euros.
On the other hand you most likely saved the company from losing multiple times that in revenue, so it's worth it to the company.
I have a bunch of friends who were single and/or child free in their 20s and have fully paid apartments/houses because they could be on call at any time because they didn't have any commitments.
I also know a good amount of incidents that were fixed in a pub corner table after a few drinks. I may or may not have contributed to that number. =)
I have yet to be on an oncall that is similar to what was described there, and I get paid pretty well. It's very much a function of where you work and what you work on.
Curious how this is the case. Do European companies not provide 24/7 services? Or staff a "follow the sun" model so no one has to answer pages outside of working hours? Do Europeans write better code so they don't need on-call?
I've worked at a few FAANGs in the US, and every single one after 2010 had an on-call rotation.
We totally have on-call, but we're also weaponizing German labor laws to force the company to have their shit together. There are a few interesting parts in there that cause quite the discomfort for employers:
The way contracts are worded, time working on on-call is work-time. Kinda obvious if you write it like that. As such, bad on-call weeks easily cut into the normal duties of the employee. This means team leads have an incentive to reduce time wasted on on-call.
You have mandatory rest-times. If an on-call activity takes an hour or so to fix, the person is suddenly not allowed to work for 10 hours due to these rest-time and maximum work time laws. Suddenly, "some little fixing at night" means the person isn't allowed to work the whole morning.
With a few rules like that, pages become really painful to the company. When a bad application kept pinging on-call every night for a few days, the entire normal work ground to a halt with people being unavailable, other team members dropping project work from sprints to pick up daily business slack. Some product managers got really pissed off and things in that product improved - I'm kinda curious what happened behind closed doors there. .
It's actually 11h of mandatory downtime "between shifts"; this does indeed provide for theoretical opportunity to get good sleep for people with a short enough commute.
1. Uncompensated oncall is legally tricky in many EU companies, so a lot of midsize companies look at the cost of paying for oncall, or sometimes just the time of administering paying for oncall, and decide they can do without. High frequency oncall is also often restricted (e.g. more than 1 week in 6 is not legal here)
2. A lot of the smaller companies are europeans selling to europeans, and are much more used to a business culture of availability during office hours. Especially there's a bigger share of like b2b back office stuff in europe compared to like, restaurant POS systems.
3. Larger companies do seem more into follow the sun. A lot of the big tech in europe are subsidiaries of US companies, so if they're in Europe it means they've already opened one remote location, and therefore are more likely to have another (California, Europe, India is a super common arrangement)
Never had on call in 25 years with American startups. Surprised that faangs have on call but offer no support or limited that may take days to get a response from a customer perspective.
Had some level of "on-call" in the sense that "there's a bug with your new/recent code and customer needs a fix" or "the demo isn't working and we present tomorrow" or "checkin deadline is in 2 weeks and my code isn't ready" pressure in 20+ years in silicon valley working at a startup and 2 medium size companies. The startup had a bit more since I knew lots of sysadmin stuff so I could help at times when our main IT guy wasn't available - but he was quite good and didn't let me have root access anyway, so it's not like I could have fixed it on those rare occasions.
Mind you, I interviewed for a Yahoo Mail job that would have included wearing a pager (back in '05). And I know it's pretty common - I've been fortunate to not have that be an issue. Hoping for another ~5 years at the current job until I can retire.
There's a great option for small companies that aren't amenable to on-call that is underappreciated. Hire an MSP. There are companies whose entire business model is having a geo distributed team with a stack of automated monitoring and run books for multiple clients. You train them, pay a set fee and never ask anyone on your team to be on call.
I’m quitting my job with nothing lined up because our oncall is such a piece of shit. House arrest every 5 weeks with no compensation. If it wasn’t for this I would just quiet quit but there’s no way I can make it through another shift without getting fired. Fuck oncall and anyone who has such little respect they think it’s ok
This article gave me unpleasant flashbacks to the first half of 2023. I resigned from planet.com in mid 2023 due to the stress caused by being on-call every second week. It took me six months to get my head into a healthy state again. Now I have a much better job, better paid and no possibility of on-call, ever.
The difference with dev oncall vs doctor on call is that it is self inflicted.
Why are you getting paged? Because you built the system.
Either your system isn't resilient enough, or you have noisy alerts. Both are problems you should be motivated to fix.
I have been on call 24/7/52 in SRE roles most of my career. It has either sucked hard, or not at all. And the time it sucked the most was because every single practice was bad. And now, I build better things because of if. Paying me more for on call wouldn't have changed how much it sucked. It wouldn't have made any material impact on my actual quality of life. But it would have done two things:
1) made me feel like I can't complain
2) give me less motivation to fix it
Paying for on call doesn't seem like a win. I want happy employees, not disgruntled but silent ones.
> Why are you getting paged? Because you built the system.
There are at least two problems with this thinking. The main problem is it's not generally true. The system is created by the entire organization. The people who raise money and allocate capital, the people who set development policies and priorities, the people who design and assemble the components, the people who sell it to customers and negotiate service levels and the people who operate and maintain it all collectively built the system.
Another problem is that it encourages moral hazards. Not paying fair on-call compensation allows unethical managers and sales staff to reap short-term rewards and bonuses by oversubscribing customers, promising more than can be delivered and rushing things to market before they're ready.
I guess what you are saying is the problem is the company culture - from a technical operations point of view at least - sucks. An no one wants or can put the effort into fixing it.
I see normally in oncall threads people complaining about "I got paged by an alerts because of another system X" - but in at least in a big enough organization this should not happen and it's an organizational failure. There should be an operations center on 24h/24h able to triage, escalate and evaluate, possibly not staffed only with L1 techs and given enough freedom to actually improve and automate. I know there are places where that is not true, and I ran away screaming from some in my career once I understood tech leadership had no understanding why it was needed.
But you would be surprised how much of the oncall pain is actually self inflicted by application teams themselves (some examples I encountered in the last year: TCP connect timeouts in the minutes and with no retries, no retry policies in general and things that should be idempotent that are not, no circuit breaker strategies, connection pools churning as they're shared between 10+ remote endpoints, wrong expectations about transaction isolation levels and how to handle conflicts at least in simple scenarios).
> I guess what you are saying is the problem is the company culture ... sucks.
I believe the problem is the way devops is often practiced. I've worked as a developer, a manager and an operator and I've occasionally carried a pager. I think there is value in rotating between those roles at different times since it enables engineers to gain knowledge and insight they often won't get any other way. But assigning engineers to after-hours on-call duties when they're simultaneously responsible for product development "because they built the system" is just a stupid unethical and unsustainable practice that needs to end.
Good companies hire and train engineers to develop, manage and operate systems sustainably.
This works if you're on call for your systems. In many situations (ranging from small startups to big tech), you're also on-call for the systems of sister teams.
Not that there aren't other ways to fix that. But fixing the erroring service isn't practical in all cases.
24x7 on-call must not be a reality imho. I know it is the reality but it should not be propagated and we should not even begin to try and somehow normalise it.
Can't this be simpler? If your system needs to working at night, and it pays (if it doesn't then what are we doing at night?), then you need to hire someone to look after it specifically at night (if possible from a geography where it is not night when it is others' nighttime.. and so on)? i.e follow the sun.
I think paid oncall could work if oncall is voluntary. The more oncall sucks, the less likely team members are to volunteer because the pay isn't worth it, then the company/team needs to pay more as an incentive to get people to volunteer for oncall. Eventually the price is so high that it becomes cheaper to just build the system correctly and stop shoehorning features in with no regard for stability.
If oncall burden is light, then everyone volunteers because it is an easy way to make a bit of money.
However, it is a huge systemic change to move towards a voluntary model. Not sure how feasible this really is.
"Well, 15 is the minimum ok? Now it's up to you if you just want to do the bare minimum, or uh..well look at Brian for example, he volunteered for 37 hours of on call"
>Why are you getting paged? Because you built the system.
I want to know in which company you've reworked at where that's even remotely true. There's always financial and time constraints that force you in trading off system resiliency for actually putting out a product.
We have integrations with dozens of external vendor APIs for which it's essentially impossible to disambiguate ahead of time whether any given error might be on their end or ours.
Yeah, I can relate to people saying they nearly got PTSD, I sure did get it. Paging apps use seriously offensive alarm sounds. I hated every sound they had in the options. It made me instantly sick. Fuck that!
The (in)offensiveness of the sound has little to do with it. For quite a few years after I left my job as an SRE at google, this gentle little tune called "morning flower"[1] which I used as my pager alarm sound would give me a little jolt of adrenaline. Even now, it's a bit startling, but it's old enough now that at least I do not hear it in the wild anymore.
The right answer is, in my experience, something that starts soft and intermittent, and then gradually ramps up in loudness and annoyance. In bed, it's enough to have a harpsichord gently playing, but if I'm watching TV or out at a restaurant I might need it to eventually escalate to a klaxon to get my attention.
> I suffered through a particularly acute week of on-call pain. At one point I was in my third or fourth video call about the same long and protracted smoldering SEV and, in a moment of frustrated weakness, I made an offhand comment about just being tired of repeatedly handling the same problem. [...] With the utterance of a single sentence, I opened a rift in the relationship with my manager that remained until the day I left that job.
It's possible that the manager dropped the ball in two ways:
1. Perma-haranguing an employee, when manager should instead have had a talk with employee about what was bothering the employee. (And split it, between the little that really needs to be said right now, while employee is already overextended, and what can happen after they recharge.)
2. The root cause of the recurring problem might be something management should've tackled already. Which is additional reason not to blame the employee, for calling out the problem in frustration.
I was on call in a 4 man startup for a 1 week rotation for about 9 months, 6 years ago. I still have an anxiety reaction when my phone rings. Can very much relate to the author's thoughts about PTSD.
This is an excellent article. I've commented before¹ why "on-call" as described is bad because it conflates roles and responsibilities and robs developers of resources, but this goes quite a bit further and explains why it's a bad practice that eventually leads to burnout.
With the exception of companies like PagerDuty, the sooner this practice is ended the better off we'll be.
deferring to best practice instead of best judgement is a major plague of the software industry these days.
best practices usually come from giant companies with tens of thousands of engineers like google (who doesn't seem to be keeping up with competition btw) and amazon (which is notorious for burning out people).
what science or evidence drives the best practices?
(Former) iOS app developer here. I was oncall once and it was actually not that bad because every change had to go through app review, which put the lower bar on response times at days if not weeks. I hate app review but it was actually very nice that "oncall" really just meant "check the Slack channel in the morning" because there was no point in doing anything faster.
> My manager was present on the call, and my statement seemed to really set him off. I was essentially told that my feelings about the situation—perhaps the only authentic part of myself I ever expressed there—were wrong. In the days that followed I was made to feel like I was not a team player, that I was not pulling my weight, and that I was not meeting the bare minimum of what was expected of a person bearing the torch of on-call. With the utterance of a single sentence, I opened a rift in the relationship with my manager that remained until the day I left that job.
So, this is just plainly bad leadership, right? Totally believable too, of course, but just really bad. Bad for the employee, but also self-defeating for the manager.
It seems like this would be an awful manager reaction to anything short of a quasi-fireable offense. If that's your response to an employee to not being enthusiastic about a part of work that sucks, what are you even doing as a leader?
In my country if you are on employee agreement, you are just paid for overtime. My company never made any problems with that payment. And surprise...there were always people who chose to take the additional hours. I was often among them - the additional money/vacation was just too nice to resist.
OP is right, but apart from that, this is one of the best written pieces I've read in an age. Agree, disagree, but it's so well written it's mesmerizing.
This article form might be cathartic for the writer, and actionable recommendations aren't the main point. But they are sprinkled throughout, and a small management slide deck could be distilled from the piece.
I recall "Soul of a New Machine" having parts that were fatiguing to read, incidentally (or intentionally) conveying the miserable mood of the characters, slogging through the project's trials.
Counterpoint: I'm at a small business and I'm primary for 24x7 oncall. I don't even take shifts with my coworker. But, this is because I'm empowered to make out of hours (overnight, weekend, holiday) calls STOP. I get woken up by something about once or twice a quarter.
When something wakes me up, the next day I start a process to ensure I don't get that same alert again: bugfixes, adjusting thresholds or time-to-critical, detecting problems and auto-remediation, determining it can be a "business hours" response.
This also requires buy-in from development. Literally yesterday I had an education opportunity with one of the developers about a ticket slated to go into production that evening that would have immediately eliminated one of our leading monitoring indicators, because it would have started creating hundreds or thousands of Sentry issues an hour. "I was thinking it was more like logging, where more information is better, where with monitoring we want the fewest messages possible."
Always, always, look at every pager hit and ask "what can prevent that from happening again?"
I was on as many as three on-call rotations for a few years. One had only two people for a while, so I was on every other week. The two things I most remember are:
* Arranging my whole life around on-call requirements. Bringing my laptop and backpack every time I went out. Designing new running routes that would use every street in a neighborhood and keep me close to home so I could respond within a 15-minute window. And yeah, the drinking thing. It pervaded my life in many ways I hadn't expected.
* Time zones and geography. These were always problematic, but especially during on-call. Often I'd narrow a problem down to a particular component that I didn't know well, so I'd try to contact the sub-team responsible for it, but nobody would respond. Then I'd try to turn the right knobs myself, and as often as not get yelled at for it in the morning. No, my afternoon, because my coworkers were three hours behind and late-commuters to boot. Of course they'd never hesitate to schedule meetings or ping me for trivial things well after my dinner.
I had taken the job, initially working on a project for which I was already a maintainer, because I wanted to avoid becoming an "architecture astronaut" by getting closer to operational reality. Indeed, I did learn a lot about how my own code behaved in real life. I don't have a problem with on-call requirements in and of themselves, but the way people and organizations handle the details is kind of <vomit emoji>.
being oncall forces the quality of software to improve.
if you want fewer incidents: ensure better QA, monitoring, smaller rollouts
usually developers start becoming more conservative after they do few oncall shifts and suddenly prioritize important reliability improvements, instead of shiny new features nobody will use
Being on-call forces the *desire* for the quality of software to improve. Shitty management can and will override that. We don't have time for QA or to waste an engineer adding monitoring, we gotta ship ship ship.
Only a manager could have such a distorted view. I'd love to work on robustness but product management has 5 years worth of feature JIRAs lined up for me.
need to bake some refactoring time into regular tickets. PM should only care about features, while software devs should provide reliable estimates on the velocity of sustainable software development
Ah, but when Alex can do 4 tickets a week baking refactoring, maintenance, tests, and observability into their work, and Blake can do 8 tickets a week focusing only on features, who do you think is going to get promoted?
These incentives then quickly devolve into a classic prisoner's dilemma. There's huge incentives to "defect" by producing quick-but-dirty work. You get the benefit of looking like you're producing rapidly, but you've made the collective experience a little bit worse.
hm... it the team is agile, then everyone does refactoring and it is team lead's job to assign tickets and evaluate. Team lead should have enough context to compare apple to apples.
if your work improving codebase is not valued, then its probably time to change job or just stop caring about code sustainability - let the business accrue technical debt, which is sometimes viable strategy if your runway and planning horizon is limited
I believe that too is as the author wrote – like a disheartening number of things in the tech industry, there are no real standards around what on-call responsibilities look like. Each organization is free to set things up in whichever way suits their tastes, and the resulting practices vary widely as a result.
Yes, makes perfect sense. I know when I want my horse to go faster, I don't entice it with more carrots, I just try to find better sticks to beat it with.
this doesn't always work. many things can go wrong in distributed systems and you cannot test for all of them. also you have no control of your dependencies like when AWS networking degrades or a 3rd party API provider changes their APIs without letting you know.
True, but these things happen very very rarely. Also:
1. Is there anything you can do about it?
No? Remove the alert, replace with a "we are down sorry" message.
Yes? Then automate that thing.
Rinse and repeat after every incident and you will eventually get paged rarely.
I think if you have a reasonable environment, where on-call feeds back to development (which is what OP is suggesting, more or less), you will absolutely get woken up for networking problems, because there's not really an alternative. Maybe some thresholding to allow for minor problems without alerting, but you know. If it's a big enough problem, someone has to fix it, and it doesn't matter if it's your problem or your dependency's problem, it breaks your service so it's your problem. If it happens a lot, you look for another network to run on.
For 3rd party APIs, if they're not critical, you start to develop kill switches. So yeah, someone has to wake up and handle it, but all they have to do is set the kill switch and go to sleep.
Personally, I did dev and on-call for SMS/Voice verification codes. Most of the time, that's in the nasty corner of it's super critical to the application (users can't use the product if they can't get a verification code) and it depends on 3rd parties that have three nines on a good year. In my case, I got tired of dealing with the disruption and developed automated routing that could manage most providers taking an outage without needing me to take action. Results could be better if a human took notice and action, but it was good enough most of the time, and partial outages were much easier for the automation to handle it.
Even if there's no way to do something like that, at least automation can take care of 3rd party API is failing hard, so mostly return errors quickly without trying the API and only let a small fraction of requests go through to sample if the service came back online. That can keep your servers from getting overwhelmed, as well as drive the alert that helps you wake up, yell at the vendor, and decide if you can go back to sleep while the system takes care of itself.
When on-call is disconnected from development, that's when it gets really miserable. If you can swing a shift-work/follow the sun operator job, that's certainly better than on-call where incidents are common and there's no feedback loop to reduce things. It may well be better even if there is a feedback loop, but the feedback loop in that case requires explicit communication and effort; if I'm on call for my own work, I don't have to tell me to not push shit code right before I leave for the day or go to sleep, I'll get that message from myself right away. If someone else is cleaning up after my messes and doesn't communicate the effects to me, I might never know.
you are right, if your software is suffering regularly from the same issues - you go and fix them at the architecture level.
network issues? use a second DC, or some HA SDN setup, or run from a second DC.
3rd party API issues? Change vendor, or send stuff to queue to reprocess later. All of these issues could and should be solved and thats the job of the developer
There is another option. The on-call person just does a deliberately piss-poor job of resolving the problem. I mean, they resolve it but they make sure it takes a hour longer than necessary.
What are they going to do, fire you? If they make life hard for you, then get another job. The shoddier your work outside of your normal hours, the better. You can have quality, speed and cheapness, but you can only pick two.
I think there's also some middle ground where you don't go out of your way to carry a laptop but you do best effort while maintaining a normal outside-work life.
at a prev us tech job, a few years in they made all engineer have oncall, with no compensation added whatsoever, and on call for other teams code in india.
i got pages a few times and didnt exactly rush to ack. not slow but didnt rush. the rotations increased more and more (mostly from other US employees quitting) till i quit. scummy company
What they don't tell you about working for yourself is the fact you can be effectively on-call 24x7 every day. I am currently supporting four wineries that are processing thousands of tonnes of receivals 24x7. It happens for two months of the year and I am expected to be available from 06:00 to 22:00 during that time, there is no phoning in sick or having a lazy day, I work alone and only have one reputation. I don't want to be that contractor forever known for destroying a clients business.
You can only do this for so long though, when two or three problems come in simultaneously it can cause issues as you drop something halfway through when something more important comes in. I once executed an SQL update query without a where clause under this kind of pressure, and ended up working until the next morning to recover, only to start again at 6AM. I have even had land-line calls at 2AM to bypass my mobile restrictions. The rewards are great, but don't let anyone tell you it is always easy.
My current system is 16 years old now and I know all the ins and outs so it has been pretty easy to keep on top of things the last several years, however I am glad the replacement system is nearly written and it will be somebody else problem in 2026.
There's a big difference, though.
In your case, you're the only employee of your business. And if you're not there the business will literally go under. And you also get directly rewarded for being there. I would guess that being 'on call' in this manner is possibly less draining on a person's soul (depending on how well they tolerate the risk of owning a business).
Contrast that with being 'on call' for your megacorporation, who isn't giving you anything extra for your on call time because they 'already pay you enough'. And where the only negative consequence for the company if you fail to immediately respond within 15 minutes is that some executive in the company is kept waiting longer than 15 minutes, or some ads aren't being shown for 15 minutes.
But if you aren't there, your boss is going to get a phone call and that's definitely not going to look good on you. And there's no bonus for fixing the problem, that was already your job in the first place. Sucks that you had to do it outside of scheduled hours, oh well.
I'm with the author of this article. Take your on-call rotation and shove it (if you're a large corporation). I'm fortunate enough to be able to take a firm stance on this point, and do so happily.
> And you also get directly rewarded for being there. ... Contrast that with being 'on call' for your megacorporation, who isn't giving you anything extra for your on call time because they 'already pay you enough'.
I'm really not seeing the distinction here. If a company offers a salary and includes on-call as part of the deal (and communicates that up front so it's not a bait and switch), how is that different than running your own business and getting compensated for your on-call time as part of a package that you sell to a client? In both cases you agree up front that you will be part of an on-call plan. In neither case are you getting a bonus for doing a good job at on-call, because either way you're just doing your job that you committed to ahead of time!
I'm totally sympathetic to people who don't want to be part of an on-call. Jobs that have on-call aren't for you, that's fine. But I don't get this idea that it's uncompensated labor, unless there are tons of people out there who somehow ended up in jobs that sprung on-call on them without warning.
The difference is agency.
Let's say I'm a business owner and I'm frustrated with the current state of the on-call system. I have options.
I can try negotiating with my clients to lessen the load in some way. Obviously this isn't always possible but it often is. I once had a freelance project that required 24 hours of on-call after a release. I negotiated release days that were convenient for me (never Fri/Sat/Sun). One time the client pushed back, I pushed back harder, and I won. In order for my push back to work I ensured that I had enough negotiating strength to do that which I planned for ahead of time.
I can upgrade my systems. For example if my current paging system is insufficient I can choose to pay $10/month more for another system that makes my life easier. I can set aside time to refactor my alerts code to make my life easier and I don't need to justify it to anyone but myself.
I can straight up refuse to do on-call and deal with the consequences to my business. Freelancer developers do this all of the time. We choose which client work to do and not to do. We can make these choices arbitrarily. Sometimes it's seasonal. Sometimes it's just based on vibes. Doesn't matter; it's our company.
Meanwhile the average on-call engineer at a large company has none of these freedoms. The underlying systems are chosen for them and they just have to deal with it.
This story is from an indian company, the on-call "expectations" might be different in your country. We were responsible for a 400k RPM service. It handled ads, so it was fairly important to the business. Whenever I had to go out for a night, or go out for a family event, or whatever, I was always able to hand over on-call to a team member for that duration. Of course, this also happened during other's on call, where I would take over. In fact, this happened daily! From ~7pm to ~9pm every day I would play football or whatever. I would always hand over on-call to another team member during this time. I usually wake up earlier than others, so I used to respond to alerts during those hours regardless of who was on call. The nights where I was staying up to watch champions league football matches or some other reason, I would take on-call as well. We just set up pagerduty's escalation order appropriately. Probably helped that there were just 5 of us in the team - easy coordination. Of course, this was my first job, and I messed up quite a bit, but I noticed the others following a similar system without me as well.
It also depends on the nature of the alerts I suppose. For us, the majority of the alerts could be checked and resolved from a mobile phone (they are alerts that could strictly speaking be resolved in an automated fashion, but the automation would get complex enough with dependencies on other service's metrics that we wouldn't be sure of not having bugs in _that_ code). About once a week or two weeks we would get an alert that needs us to look at the logs and so on.
> Meanwhile the average on-call engineer at a large company has none of these freedoms. The underlying systems are chosen for them and they just have to deal with it.
In most cases they have all of those freedoms, and the only barrier is one that's shared with the self-employed person: not liking the consequences of choosing those options.
They could negotiate with their manager to lessen the load. They could upgrade the systems. They could straight-up refuse on call.
They don't because they don't like the consequences of taking these options—and neither does the self-employed person!
> not liking the consequences of choosing those options
Correct. "Shove it" is usually preceded by not liking something.
> They could negotiate with their manager to lessen the load.
Most of the time the manager will simply refuse. As a business owner it's my decision.
> They could upgrade the systems.
At big companies this is usually outside the scope of an on-call engineer. The on-call engineer often doesn't even have commit rights to that repository.
The specific example I gave was paying $10/month more. That can be a very hard sell at a large company because their service contracts are much more complicated/expensive.
> They could straight-up refuse on call.
A business owner has much more negotiating power than an employee does.
> They don't because they don't like the consequences of taking these options—and neither does the self-employed person!
In the vast majority of cases making changes to the on-call infrastructure has very little (if any) measurable impact on the business. Like spending a week making the systems better. Or changing deploy/release dates to be more convenient.
As a business owner I can take advantage of this and make my life easier.
As an employee I have layers of bureaucracy to wade through and will probably be refused. Not because it affects the business but for other reasons.
That's the difference.
Do others not generally get extra pay for the time on call?
I have the enviable situation where I am on call for half of every month, I get paid significantly extra for this, and there's maybe 1 emergency call per year.
My last several jobs the extra pay has been a small phone stipend, and perhaps a very small token sum (maybe $50 for the week).
The only time I made significant money on call was early in my career as a contractor.
Yeesh, I get more than that per day. I didn't realize others had it so bad.
Nope. Amazon, for instance, has their engineers on call in a variety of roles with no additional pay.
Salaried tech employees do not get extra pay for being on call, generally.
The big difference is as an owner you are fully in control of allocating your time, and so if out of hours workload is becoming too much, you can choose to not work on other things in favour of fixing that. In the corporate world, there's some manager who weighs up spending two weeks to properly fix an issue or automate a process vs just making their workers unhappy and doing something that will make the manager look good in the internal politics, and often will insist on the latter.
But that's the problem with being in a bad company no matter what. If it weren't on call it would be something else that that manager would be making you suffer through. That doesn't mean on-call as a concept is terrible or that it's uncompensated labor, it just means that bad managers are capable of screwing up your life with any tools that you give them.
Bad managers are hard to predict. Even if you like your manager now, there's nothing to say they can't be promoted/reassigned/quit and you get a new manager that sucks. On-call being a thing is easier to get an answer on ahead of time
But it's meaningless a signal for bad managers because basically everyone does on call in some form. That's what this whole discussion is about: how ubiquitous it is.
The only job I worked that didn't have a formal on-call rotation ended up with me unofficially on on-call, with the same expectations as though that had been set up up front: boss calls whenever he calls and expecting an answer, and I'm left deciding how badly I want this job. Turns out management there sucked and I ended up on on-call after all.
If you find a company that actually has a good story for why they're able to get away with no on call, that might be a good signal. But if they're out there I'd love to hear from them, because most people here are just speculating about better alternatives, not speaking from experience with ones that actually worked.
At Amazon it’s common to have terrible, week-long oncall shifts with many repetitive pages. At Google they have shifts that follow the sun and they get PTO to compensate for being oncall during the weekend. Both jobs pay similarly. And I think most people joining Amazon don’t know about the oncall they are in for.
> most people joining Amazon don’t know about the oncall they are in for.
In general people joining Amazon should be aware that they're joining a company that is fiercely and proudly opposed to work-life balance.
This ultra-libertarian take is consistent but not realistic. Realistically, the group decides that some amount of work or sacrifice is not compatible with having a good life, so laws are passed that either disallow such extremes or mandate extreme compensation.
This goes back to Shabbat/Sabbath.
My friend worked for Amazon in India (software). He was often on 24x7 "on calls", which is touted as "good" here (because how else will you "learn"), during his 3rd or 4th week. By third night he was vomiting and had to visit the doctor. His manager called and had asked whether he had brought his laptop with him. His mother forced him to resign next day (he is from an extremely rich kind of family though). It is common here in most companies and among famous MNCs it is especially known in Amazon and Uber in India.
What shocks more is these are the companies that can "follow the sun" w/o breaking a drop of fucking sweat!
I have lost too many interviews just because I clearly asked for this, I always do, and I am doing it even now while I have been without a job after taking a year gap (which makes getting calls already difficult esp. with this AI and vibe onslaught). I am not giving up on this. I personally have never agreed to this which has caused lots of confrontations and stress(!!); a major source of my burnout WITHOUT ever doing the 24x7 on-call - so by just fighting it and keeping it away from me alone I was burnt out to the bits. It took me finally seeking medical advice to realise I was burnt out.
I hope this is not sounding like dramatic but even now when I have been resting, travelling for a year the mere mention of words like Splunk, VictorOps (same as Splunk iirc), PagerDuty give me minor trigger attack kinda sensation - make me very agitated.
But this is so common here. So common that it is considered one of the realities, truths. Yet, I have never understood, how, how can one agree to this? How? Is it some kind of social (if not racial) slave mentality? Is it some kind of grand coercion that they have no escape from? Or maybe it's just generation after generation subjecting the next generation to what they were subjected to while the stakeholders in the richer countries (because that is the structure) demand of this implicitly as they are stopped by health and safety laws from subjecting their underlings in their own developed home nations maybe.
Working for yourself is totally different.
It's like demands from tech executives for long hours: "I worked long hours to make myself rich; why won't all of you work long hours to make me richer?"
I somehow do not think OP is working for themselves. They are a contractor there. I do believe contracting is just being an employee on slightly different terms.
> I work alone and only have one reputation. I don't want to be that contractor forever known for destroying a clients business.
Hard to say, interesting observation:
> I am currently supporting four wineries that are processing thousands of tonnes of receivals 24x7... I don't want to be that contractor forever known for destroying a clients business.
If the 'clients' being referred to are the wineries, then it sounds like a self-run company. IE: the company is operating as a contractor to the winery clients, for whom (the wineries) a failure of the contractor (the company) would be a disaster for their business (the client, the wineries).
OTOH if "clients" refers to the business (the company) that in turns does the support for the wineries, then yes - an individual contractor.
The distinction would seem to me to really change the entire tenor of the comment. I'm curious which it is.
Would it make sense to hire someone? When you run a solo person business getting over the mental hump about hiring someone is difficult. If rewards are great and things are stable (16 years is stable) what is preventing you from hiring someone to help with at least some aspect?
"The E-Myth Revisited" calls this working on your business as opposed to working in your business. Otherwise you don't own a business, you own a job.
I thought about hiring somebody five years ago, but the fact is my client has gone itself from a small/medium sized business to a corporate, from no IT staff to now thirty and they are working to replace the production systems I wrote the middle of this year. To be frank having that much responsibility as a one man business is quite scary. Also when I originally wrote the systems for them I had no clue it was going to get this busy and no clue I would be 24x7 support, also at 62 years old I really want to start winding down :-)
For "non-exempt" employees, that's paid "stand-by time" California.[1] Also see this case involving on-call coroners.[2]
The way this works in most unionized jobs is that there's a stand-by rate paid for on-call hours, plus a minimum number of hours at full or overtime pay, usually four, when someone is called to duty. This is useful to management - if the call frequency is too high, it becomes cheaper to hire an additional person.
[1] https://www.dir.ca.gov/dlse/CallBackAndStandbyTime.pdf
[2] https://casetext.com/case/berry-v-county-of-sonoma
Excellent article. I can relate to a lot of it. The sad part is that we can't even control the quality of the systems we're oncall for. We're pushed by management for new features, not for robustness of the tools. Also some systems have no clear ownership, so nobody has an incentive to fix them. It'll be next oncall's business. Oncall is really the worst part of my job. I can stand long hours but this is something else.
One of the sidebars mentions that: "The production system in question is almost certainly a schizophrenic box of compromises brought about through poor decision-making, unaddressed technical debt, design-by-committee, and impossible timelines and budgets. This is not a system that any single rational human being on the team would’ve chosen to build if permitted to do so alone. Trying to assert ownership over an environment like that is just begging to get your shit rocked."
That’s basically every company and every system ever. Things are always in a state of flux, constantly being worked on. People come and go, priorities change and technologies evolve.
Exactly.
> Also some systems have no clear ownership, so nobody has an incentive to fix them.
It’s even worse when the system isn’t business-critical: a reporting service, a manual intervention tool, something that quietly supports a process. When it fails, everyone is affected, but no one is accountable.
Ironically, these are often legacy systems that have been rock-solid for years — so reliable they’re forgotten… until they break.
I've been on call for almost 20 years. If a system is crashing often and disturbing your sleep but is not being prioritized to get fixed, then stop answering the calls. If it was important last night then it's still important this morning.
Be vocal and say "I will no longer respond when system XYZ goes down unless serious efforts are made to fix it."
If you get push back explain that you will also call the person telling you it's not important enough to fix each time it pages, and be willing to do so. What can they say?
So your solution is just "Get fired"?
If your case is valid you won't get fired for standing up for yourself. You might take some political damage but the guy willing to waste your sleep time was going to lay you off/betray you anyway.
> You won't get fired for standing up for yourself.
Yes you will, or can. "At will" employment in the US.
I feel for you, I’ve also suffered through this a lot over the years, and am finally at the stage of career and wisdom to start pushing back on the quality that I can’t control and ensuring that others are equally as accountable for their mess.
For one particular occasion , once we took blame out of the equation (at least within the engineering team) and started doing Post Incident Reports, the incentives finally became clear for the business as we were able to compile a list of recurrent issues during every issue, calculate a financial loss and present it for inspection each and every time they either began a witch hunt for downtime or refused to allocate time to backlog. Small wins.
I just want to point out that the answer is shift work. Here's an example of an SRE job at a national lab:
https://lbl.referrals.selectminds.com/jobs/site-reliability-...
"Work 5 shifts per week to monitor the NERSC HPC Facility, which includes 2 - 3 OWL (midnight - 8am) shifts. Some days may be onsite, some may be offsite. The schedule will be determined by staffing needs."
40 hours per week, full salary, full disclosure about the night shifts, but none of this 24x7 wake up in the middle of the night on top of your regular job bullshit that the tech industry insists on.
there's nothing wrong with shift work, and there's nothing wrong with the graveyard shift (for people who want that job), but there is a LOT wrong with alternating day and night shifts in the same week every week, and nobody should agree to do that. (perhaps if you are in your early twenties you feel like you can handle it, but I'm not sure that's a good idea either)
I tried graveyards and swing shifts in my twenties. I only did it for a year, but it wrecked my sleep schedule for years afterward; I struggled with insomnia because of it. Not sure there is any age where this works.
Some people can handle it better than others. I'll never do it again though.
some of the people I know who've made it work, and I'm not recommending this either, are "hardworking ambitious immigrant" types who do it to hold down two full time jobs. tends to be in industries where the overnight shift is a quiet shift rather than something like running the same factory at full tilt overnight.
My natural rhythm makes me a nocturnal being, while the society works 9 to 5. Once I realized this, it became obvious why I feel like fuck all the time. Can't wait to retire.
Shift work sucks more than on call for me. I like having a flexible schedule that I can work whenever I want, and I'll happily take the remote risk that I'll get paged during my one week per quarter on call rotation in order to guarantee that no one expects me to be on during set hours the rest of the year.
I think the problem isn't on call itself, it's that a lot of companies suck at on call. If your rotation is every 3 weeks and you get woken up in the night at least once per rotation, then yeah, that's awful. But the problem you have in that case is that stuff is always on fire and you don't have enough people on the rotation, not with on-call as a concept.
I'd take a night-shift job over being on call ontop of a 9-5.
I think the real answer here is to have team members on multiple continents. Have some team members NA/EU/Asia then it's always reasonable hours for someone to deal with the production problem. High priority issues can be worked on around the clock without anyone working overtime.
That's still shift work. It's still the company assuming that I'll be available during N specific hours of the workday in order to fix issues.
Look at it this way:
If I work a job where I'm expected to be on 9-5, 46 weeks per year, 40 hours each week, that amounts to 1840 hours of scheduling my life around my employer.
If I work a job where I can schedule my work however I like and also have on call 1 week out of every quarter, the worst case scenario there is 672 hours of scheduling my life around my employer (and in practice the demands of on call in my current rotation are far less than that). The rest of my life I can schedule as I please, so long as I do my job.
I would rather take the option that minimizes the number of hours where my employer gets to tell me where to be.
I know plenty of friends (mostly medical) who've had "on call" shifts on top of their 9 to 5, but in those cases it was pretty exceptional for them to actually receive a call, and the "interruptions" would be a small number of hours.
Here's an idea: Compensate any on-call work received during off hours at 10X the normal hourly rate. E.g., if my salary is $150K per year, then my hourly pay rate is about $75 per hour, so compensate my on call work at a rate of $750 per hour. Thus if I get a call at 10pm, log in to my laptop and work for 30 minutes to resolve the issue to a satisfactory level, then I pocket $375. That puts a financial incentive on companies to structure their on call protocols so that only the most important calls are handled. And I can envision variations on this theme. Different sorts of on-call disasters could offer bids for how much they're worth to fix based on some automated rubrick, and anyone on the ENG team could pick these up on a first-come, first-serve basis. Or various combinations of the above for a guaranteed backup person. But the companies should offer enough incentive to make it worthwhile. And this is in the companies' own best interest. To maintain a workforce that can think clearly during the normal work, to have a good reputation in the industry, to get good reviews on Glassdoor, etc.
Wouldn't that incentivise staff to take longer to fix issues? Once you've been interrupted, you might as well turn 30 minutes into 60 minutes, etc.
Many systems that pay hourly for task-based work like this deal with this problem by instituting a minimum number of hours of pay per-instance, which is usually higher than the expected time it takes to complete a typical quick task.
That way, by taking longer on any but the hardest issues, you are instead removing your ability to make more money on other, faster issues.
If you call out a master electrician to flip a circuit breaker, they are going to charge you a lot more money than for the half second it took to flip the switch.
Also, if the reason they have to come flip that switch is that they screwed up the job they did earlier that week, you don't get charged at all.
This thread is full of people acting like highly experienced trade workers are idiots who have never thought of how hourly work might be gamed for more money. All of this has been long since solved by the industries that actually operate this way.
> Many systems that pay hourly for task-based work like this deal with this problem by instituting a minimum number of hours of pay per-instance, which is usually higher than the expected time it takes to complete a typical quick task.
That's how it works for the occasional on-the-side server/printer tech jobs I occasionally take (long story short: I took a temp IT job years ago, resigned to go somewhere else, but the company never took me off their payroll so I get the occasional call to go install some number of printers or some number of servers/switches/etc. for some customer of HP or Dell, respectively). The usual rates are pretty abysmal for someone of my experience and skill level, but the 4-hour minimum means that if I can bang out one of these jobs in an hour or less I'm making more per-hour than at my day job. Nice bit of occasional money to blow on craps or penny stocks or shitcoins or whatever, and it keeps my fingers on various industry pulses.
Have a minimum of say, 60 minutes, and if that is exceeded, the issue gets escalated or deferred. If deferred, presumably to the next day shift, the cost is limited. If escalated, the second person must also defend the time spent. If management still doesn't trust their workers to be honest, then the company has other issues that tweaking on-call will not solve.
This will just lead to classic Cobra Effect where people just push shit on Friday afternoon and then take their sweet time to fix it
The article mentions exactly that excuse:
> If on-call engineers were to receive compensation for each incident they resolved, it would incentivize them to intentionally build systems that fail so they could increase their pay by increasing their on-call load.” My guy, that is sabotage and fraud. You are hypothesizing a scenario where your subordinates are committing actual crimes. If somebody is doing criminal acts at work, fire their ass! Not to mention that anybody who deliberately self-inflicts on-call load is a goddamn idiot and should be sacked just on that basis alone.
Deliberately breaking the system is different from taking your sweet time fixing an issue.
It's a matter of degree. Sabotage doesn't always mean instant breakage. In "The Simple Sabotage Field Manual" we learn to slow down the organization by things like "Bring up irrelevant issues as frequently as possible", "Work slowly. Think out ways to increase the number of movements necessary on your job.", and "Do your work poorly and blame it on bad tools, machinery, or equipment."
If you are the person who consistently takes 10x longer than everyone else to fix issues, then someone has a conversation with you. It's not that hard.
Good suggestion and I can see the benefit for honest people, but unfortunately it’s as equally a system for financial abuse for others - sometimes enough to prevent people fixing things during their regular hours just to benefit at other times.
A good counter balance to this might be to offer even more compensation for no incidents, or otherwise well handled incidents that go on to squash types of that incident now and into the future.
If I’m rewarded for no incidents, doesn’t that mean I’m punished for there being incidents?
Not necessarily, but agree that bad managers or organisations have a tedency to do this.
I guess the ultimate goal is to keep everyone happy right? Everyone has different ideals, you can probably assume the business wants everything to work by spending as little as possible, employees want to be paid as much as possible while enjoying their job. Striking that balance is always a challenge.
Some long haul truckers have a relatively workable pay scale. If it wasn’t for how far they expect you to drive in 24 hours, it might be considered good.
The general shape of it at least makes sense. You get paid when you’re on the road. More if you’re driving than if you’re parked, and more if you drive for more than your 40 hours a week.
This makes it in the employee's interest to obfuscate and extend any remediation, to get paid more.
It is also in my interest to skip spotting bugs during code review so I can look like a genius when I fix them when they cause issues in production. Of course I don't do that because this is incredibly stupid and I have better ways to spend my time.
10x pay is significant. Personally, I would remediate asap even with $1500/hour on the line. I do have ethical standards.
But would I be as motivated to stop the root cause of the recurring issues that keeps giving me an extra $3k every shift? Well, I probably would, but my coworkers already need to be goaded into fixing root causes, and they're not getting paid extra!
In general, I think conflicts of interest must be handled carefully, and ideally avoided. Paying 10x wages when incidents happen is a clear conflict of interest.
That’s true for any hourly paid job. Employers can choose to fire those who don’t work efficiently enough. What they can’t do is not pay people for hours worked —and with tech it’s easy for them to tell how long you are logged in for, to avoid them underpaying you.
Having overtime pay that is a multiple of regular hourly rate is mandatory is many countries in Europe. Are you saying that European software tends to be more obfuscated? (answer: it is not).
Employees are also subject to the Working Time Directive in EU countries which sets limits to the amount of overtime that is permitted in a week and in a month. Unfortunately in most countries it's full of loop holes.
One of the biggest problems with on-call rotations is that you're actually incentivized to do it poorly. Every minute you spend doing on-call work is time that you can't spend on the things you've actually been assigned to do. You're never going to put your on-call work into your performance reviews; doing so might actively work against you. "I see that you spent time tuning alerts and updating the runbook. That's time you didn't spend on the actual tasks that were assigned to you."
If it's better to spend the least amount of time doing on-call work then the logical conclusion is that it's best to snooze as many alerts as possible until they either go away on their own or roll over past your rotation. Fixing the underlying problem might be worthwhile if it's something that you can fairly easily fix but if the on-call rotation is more than 2 people, the underlying problem is mathematically unlikely to be of your making and is it really a good idea to make a habit of fixing other people's broken code?
What's crazy is that I've never seen anyone with on call duties acting in this worst case bad faith manner. Companies basically abuse the work ethic of their employees because it's the cheapest possible way to check that box.
> Every minute you spend doing on-call work is time that you can't spend on the things you've actually been assigned to do.
In my experience at least if you're oncall during a sprint you would have less work assigned to you than otherwise (2 week sprint and 1 week you are oncall? 50% allocation) as the expectation is that week you will spend responding to alerts, or investigating issues, or even improving alerting and dashboard and fixing bugs. If this does not happen, devs don't push for it and management is completely blind to it you have an organization issue. If leadership does not care about the problem it's time to jump ship ASAP.
But I've seen people stubbornly defending an alert on >60% CPU usage of their 1 CPU allocated kubernetes pods where there was no impact in p99.9 latency (which was measured and was the actual metric that mattered as agreed with the rest of the business and internal customers of the service). Or alerting on each single pod restart. That is self inflicted pain.
It's just more hazing so the underclass of engineers don't realize they actually build everything big tech is selling and more.
That article made me shudder with echos of having what we used to call “beeper madness” back in the 1990s. After a while of being on a roster of on call weeks, anything that beeped would make you reach for that pager on your belt.
As a kid the first few weeks were kind of exciting as it felt like you had been elevated to a new level of responsibility. Once that wore off it was obvious what a cage it was.
I don’t miss pagers.
I refused a contract to hire that was talking pager duty.
I saw how they freaked out about things outside the team’s control during business hours. The first time someone called me after 11 pm I was going to get myself fired talking to them.
" insure against every possible thing that could ever go wrong, they would have to build a second studio on a separate part of the city’s electric grid, with redundant copies of all the equipment and broadcast content, along with a full crew of understudies ready to take over at a moment’s notice."
WTH?? I guess this person has never heard of backup generators? Every broadcast TV station has them.
To begin with, airplanes do fall out of they sky, sometimes right on your backup generator. (speaking from military experience where yes, there is an entire backup studio waiting to take over, just in case. Or rather, the 'studio' is geographically dispersed, with 100% redundancy, which is another way of saying the same thing.)
But more importantly, *this* is what you noticed from that article?
No, but this was the point where the "let's make something up" got to be too much.
I'm not talking about low budget UHF channels, but TV stations I've been in and around all have multiple studios. If the switcher in Studio-A goes down, the signals can be routed to the control room for Studio-B. Also, Alex the know-it-all is such a forced thing that is just ridiculous and eye roll inducing. Anybody that is a jack of that many trades is a master of none of them. The entire forced analogy just got to be too much and I lost interest before a point was ever made.
That just covers electricity. They seem to be implying coverage of a multi-failure scenario.
Also, I don't think broadcast TV news is quite as reliable as the post makes it out to be.
Like this exchange happens all the time:
"and now we're going to our on site reporter, Onda Premises"
<45s silence>
"Oh, it appears we've lost them for now, we'll cycle back later. In our next story...."
Going to a remote is not the same thing as backup/redundancy within the studio. The broadcast was never interrupted. The latency between remotes can be mindnumbing, and with inexperienced reporter/anchor stepping over each other unable to sit through the delay it's pain inducing for the viewer. But that's an unrelated tangent
PTSD. I was that guy dragging his bagpack everywhere. A drink at a bar on Saturday ? Sunday lunch with the in-laws ? with my bag pack, ready to bust the laptop in case of emergency.
My two personal lows... I had to pull the laptop at my own birthday party, in a restaurant. And at a funeral (not proud on that one).
So sad.
That seems like a very long-winded way to say you hate on-call, which is a completely normal thing to do. That said, is on-call effectively mandatory or very popular in the US startup world? Because here, in the European established company world, I can’t really recall seeing a job posting with on-call listed.
In Finland you usually get a static pay just for being on call, meaning you're X minutes away from a company-approved device you can use to fix things if they break.
If there's an actual alert you need to respond to, your pay goes to 2x hourly rate, or 3-4x if it's the weekend and there usually is a minimum amount of billing you do if you have to do any work, usually 30-60 minutes. So if you get an alert and you fix it by pressing the "fix it" button on the dashboard on a Sunday, you just got paid hundreds of euros.
On the other hand you most likely saved the company from losing multiple times that in revenue, so it's worth it to the company.
I have a bunch of friends who were single and/or child free in their 20s and have fully paid apartments/houses because they could be on call at any time because they didn't have any commitments.
I also know a good amount of incidents that were fixed in a pub corner table after a few drinks. I may or may not have contributed to that number. =)
Yes, basically every eng position paying what HN people expect will have the exact sort of described oncall.
I worked on OS frameworks at a FAANG and there was no on-call for anyone on the team.
The users of your flawless frameworks absorbed all on-call load.
There really is no point in on-call when writing client-side software.
I have yet to be on an oncall that is similar to what was described there, and I get paid pretty well. It's very much a function of where you work and what you work on.
Curious how this is the case. Do European companies not provide 24/7 services? Or staff a "follow the sun" model so no one has to answer pages outside of working hours? Do Europeans write better code so they don't need on-call?
I've worked at a few FAANGs in the US, and every single one after 2010 had an on-call rotation.
We totally have on-call, but we're also weaponizing German labor laws to force the company to have their shit together. There are a few interesting parts in there that cause quite the discomfort for employers:
The way contracts are worded, time working on on-call is work-time. Kinda obvious if you write it like that. As such, bad on-call weeks easily cut into the normal duties of the employee. This means team leads have an incentive to reduce time wasted on on-call.
You have mandatory rest-times. If an on-call activity takes an hour or so to fix, the person is suddenly not allowed to work for 10 hours due to these rest-time and maximum work time laws. Suddenly, "some little fixing at night" means the person isn't allowed to work the whole morning.
With a few rules like that, pages become really painful to the company. When a bad application kept pinging on-call every night for a few days, the entire normal work ground to a halt with people being unavailable, other team members dropping project work from sprints to pick up daily business slack. Some product managers got really pissed off and things in that product improved - I'm kinda curious what happened behind closed doors there. .
It's actually 11h of mandatory downtime "between shifts"; this does indeed provide for theoretical opportunity to get good sleep for people with a short enough commute.
Nothing like that in Canada or, I suspect, the USA. You'd be expected to work the morning after.
I think it's mostly down to three factors:
1. Uncompensated oncall is legally tricky in many EU companies, so a lot of midsize companies look at the cost of paying for oncall, or sometimes just the time of administering paying for oncall, and decide they can do without. High frequency oncall is also often restricted (e.g. more than 1 week in 6 is not legal here)
2. A lot of the smaller companies are europeans selling to europeans, and are much more used to a business culture of availability during office hours. Especially there's a bigger share of like b2b back office stuff in europe compared to like, restaurant POS systems.
3. Larger companies do seem more into follow the sun. A lot of the big tech in europe are subsidiaries of US companies, so if they're in Europe it means they've already opened one remote location, and therefore are more likely to have another (California, Europe, India is a super common arrangement)
Never had on call in 25 years with American startups. Surprised that faangs have on call but offer no support or limited that may take days to get a response from a customer perspective.
Had some level of "on-call" in the sense that "there's a bug with your new/recent code and customer needs a fix" or "the demo isn't working and we present tomorrow" or "checkin deadline is in 2 weeks and my code isn't ready" pressure in 20+ years in silicon valley working at a startup and 2 medium size companies. The startup had a bit more since I knew lots of sysadmin stuff so I could help at times when our main IT guy wasn't available - but he was quite good and didn't let me have root access anyway, so it's not like I could have fixed it on those rare occasions.
Mind you, I interviewed for a Yahoo Mail job that would have included wearing a pager (back in '05). And I know it's pretty common - I've been fortunate to not have that be an issue. Hoping for another ~5 years at the current job until I can retire.
I've been on-call (sometimes on a rotation, sometimes always) for the past 10+ years. So, yes. It's common.
on-call is ubiquitous in the US tech industry. I've never had a job without it.
There's a great option for small companies that aren't amenable to on-call that is underappreciated. Hire an MSP. There are companies whose entire business model is having a geo distributed team with a stack of automated monitoring and run books for multiple clients. You train them, pay a set fee and never ask anyone on your team to be on call.
I’m quitting my job with nothing lined up because our oncall is such a piece of shit. House arrest every 5 weeks with no compensation. If it wasn’t for this I would just quiet quit but there’s no way I can make it through another shift without getting fired. Fuck oncall and anyone who has such little respect they think it’s ok
This article gave me unpleasant flashbacks to the first half of 2023. I resigned from planet.com in mid 2023 due to the stress caused by being on-call every second week. It took me six months to get my head into a healthy state again. Now I have a much better job, better paid and no possibility of on-call, ever.
The difference with dev oncall vs doctor on call is that it is self inflicted. Why are you getting paged? Because you built the system. Either your system isn't resilient enough, or you have noisy alerts. Both are problems you should be motivated to fix.
I have been on call 24/7/52 in SRE roles most of my career. It has either sucked hard, or not at all. And the time it sucked the most was because every single practice was bad. And now, I build better things because of if. Paying me more for on call wouldn't have changed how much it sucked. It wouldn't have made any material impact on my actual quality of life. But it would have done two things: 1) made me feel like I can't complain 2) give me less motivation to fix it
Paying for on call doesn't seem like a win. I want happy employees, not disgruntled but silent ones.
> Why are you getting paged? Because you built the system.
There are at least two problems with this thinking. The main problem is it's not generally true. The system is created by the entire organization. The people who raise money and allocate capital, the people who set development policies and priorities, the people who design and assemble the components, the people who sell it to customers and negotiate service levels and the people who operate and maintain it all collectively built the system.
Another problem is that it encourages moral hazards. Not paying fair on-call compensation allows unethical managers and sales staff to reap short-term rewards and bonuses by oversubscribing customers, promising more than can be delivered and rushing things to market before they're ready.
If you want happy employees, treat them fairly.
I guess what you are saying is the problem is the company culture - from a technical operations point of view at least - sucks. An no one wants or can put the effort into fixing it.
I see normally in oncall threads people complaining about "I got paged by an alerts because of another system X" - but in at least in a big enough organization this should not happen and it's an organizational failure. There should be an operations center on 24h/24h able to triage, escalate and evaluate, possibly not staffed only with L1 techs and given enough freedom to actually improve and automate. I know there are places where that is not true, and I ran away screaming from some in my career once I understood tech leadership had no understanding why it was needed.
But you would be surprised how much of the oncall pain is actually self inflicted by application teams themselves (some examples I encountered in the last year: TCP connect timeouts in the minutes and with no retries, no retry policies in general and things that should be idempotent that are not, no circuit breaker strategies, connection pools churning as they're shared between 10+ remote endpoints, wrong expectations about transaction isolation levels and how to handle conflicts at least in simple scenarios).
> I guess what you are saying is the problem is the company culture ... sucks.
I believe the problem is the way devops is often practiced. I've worked as a developer, a manager and an operator and I've occasionally carried a pager. I think there is value in rotating between those roles at different times since it enables engineers to gain knowledge and insight they often won't get any other way. But assigning engineers to after-hours on-call duties when they're simultaneously responsible for product development "because they built the system" is just a stupid unethical and unsustainable practice that needs to end.
Good companies hire and train engineers to develop, manage and operate systems sustainably.
This works if you're on call for your systems. In many situations (ranging from small startups to big tech), you're also on-call for the systems of sister teams.
Not that there aren't other ways to fix that. But fixing the erroring service isn't practical in all cases.
24x7 on-call must not be a reality imho. I know it is the reality but it should not be propagated and we should not even begin to try and somehow normalise it.
Can't this be simpler? If your system needs to working at night, and it pays (if it doesn't then what are we doing at night?), then you need to hire someone to look after it specifically at night (if possible from a geography where it is not night when it is others' nighttime.. and so on)? i.e follow the sun.
I think paid oncall could work if oncall is voluntary. The more oncall sucks, the less likely team members are to volunteer because the pay isn't worth it, then the company/team needs to pay more as an incentive to get people to volunteer for oncall. Eventually the price is so high that it becomes cheaper to just build the system correctly and stop shoehorning features in with no regard for stability.
If oncall burden is light, then everyone volunteers because it is an easy way to make a bit of money.
However, it is a huge systemic change to move towards a voluntary model. Not sure how feasible this really is.
You will get PIPed for not signing up for being on call before they raise the on-call bonus.
"we need to talk about your on call volunteering"
"Really? I volunteered for 15 hours of on call"
"Well, 15 is the minimum ok? Now it's up to you if you just want to do the bare minimum, or uh..well look at Brian for example, he volunteered for 37 hours of on call"
Good point. That sounds about right.
>Why are you getting paged? Because you built the system.
I want to know in which company you've reworked at where that's even remotely true. There's always financial and time constraints that force you in trading off system resiliency for actually putting out a product.
We have integrations with dozens of external vendor APIs for which it's essentially impossible to disambiguate ahead of time whether any given error might be on their end or ours.
Yeah, I can relate to people saying they nearly got PTSD, I sure did get it. Paging apps use seriously offensive alarm sounds. I hated every sound they had in the options. It made me instantly sick. Fuck that!
The (in)offensiveness of the sound has little to do with it. For quite a few years after I left my job as an SRE at google, this gentle little tune called "morning flower"[1] which I used as my pager alarm sound would give me a little jolt of adrenaline. Even now, it's a bit startling, but it's old enough now that at least I do not hear it in the wild anymore.
[1] https://www.youtube.com/watch?v=4hr8lhO3k5o
The right answer is, in my experience, something that starts soft and intermittent, and then gradually ramps up in loudness and annoyance. In bed, it's enough to have a harpsichord gently playing, but if I'm watching TV or out at a restaurant I might need it to eventually escalate to a klaxon to get my attention.
My first experience with this was Groovy Blue: https://youtu.be/WCbyp_UuyoM?si=s0ZpfcXZKw7dP17i
I want to armchair quarterback this:
> I suffered through a particularly acute week of on-call pain. At one point I was in my third or fourth video call about the same long and protracted smoldering SEV and, in a moment of frustrated weakness, I made an offhand comment about just being tired of repeatedly handling the same problem. [...] With the utterance of a single sentence, I opened a rift in the relationship with my manager that remained until the day I left that job.
It's possible that the manager dropped the ball in two ways:
1. Perma-haranguing an employee, when manager should instead have had a talk with employee about what was bothering the employee. (And split it, between the little that really needs to be said right now, while employee is already overextended, and what can happen after they recharge.)
2. The root cause of the recurring problem might be something management should've tackled already. Which is additional reason not to blame the employee, for calling out the problem in frustration.
I was on call in a 4 man startup for a 1 week rotation for about 9 months, 6 years ago. I still have an anxiety reaction when my phone rings. Can very much relate to the author's thoughts about PTSD.
This is an excellent article. I've commented before¹ why "on-call" as described is bad because it conflates roles and responsibilities and robs developers of resources, but this goes quite a bit further and explains why it's a bad practice that eventually leads to burnout.
With the exception of companies like PagerDuty, the sooner this practice is ended the better off we'll be.
¹ https://news.ycombinator.com/item?id=43400898
Jeez I guess what we do is industry standard best practice, and it sucks.
deferring to best practice instead of best judgement is a major plague of the software industry these days.
best practices usually come from giant companies with tens of thousands of engineers like google (who doesn't seem to be keeping up with competition btw) and amazon (which is notorious for burning out people).
what science or evidence drives the best practices?
> what science or evidence drives the best practices?
management "science"
(Former) iOS app developer here. I was oncall once and it was actually not that bad because every change had to go through app review, which put the lower bar on response times at days if not weeks. I hate app review but it was actually very nice that "oncall" really just meant "check the Slack channel in the morning" because there was no point in doing anything faster.
> My manager was present on the call, and my statement seemed to really set him off. I was essentially told that my feelings about the situation—perhaps the only authentic part of myself I ever expressed there—were wrong. In the days that followed I was made to feel like I was not a team player, that I was not pulling my weight, and that I was not meeting the bare minimum of what was expected of a person bearing the torch of on-call. With the utterance of a single sentence, I opened a rift in the relationship with my manager that remained until the day I left that job.
So, this is just plainly bad leadership, right? Totally believable too, of course, but just really bad. Bad for the employee, but also self-defeating for the manager.
It seems like this would be an awful manager reaction to anything short of a quasi-fireable offense. If that's your response to an employee to not being enthusiastic about a part of work that sucks, what are you even doing as a leader?
Highly recommend listening to the 70s country gem that the title is an homage to: Take This Job And Shove It [1] by (the aptly named) Johnny Paycheck
[1] https://youtu.be/gj2iGAifSNI
In my country if you are on employee agreement, you are just paid for overtime. My company never made any problems with that payment. And surprise...there were always people who chose to take the additional hours. I was often among them - the additional money/vacation was just too nice to resist.
This sounds like a healthy arrangement. Compensate fairly for overtime and those who want it will take it.
I had the same reaction as the author, the first time I heard the Kafka name, and then the same reaction when I heard the reasoning behind the name.
(Then I wondered when would be a sufficiently good occasion to say "Kafkaesque" regarding the software, since that's not something you want to waste.)
OP is right, but apart from that, this is one of the best written pieces I've read in an age. Agree, disagree, but it's so well written it's mesmerizing.
This article form might be cathartic for the writer, and actionable recommendations aren't the main point. But they are sprinkled throughout, and a small management slide deck could be distilled from the piece.
I recall "Soul of a New Machine" having parts that were fatiguing to read, incidentally (or intentionally) conveying the miserable mood of the characters, slogging through the project's trials.
Counterpoint: I'm at a small business and I'm primary for 24x7 oncall. I don't even take shifts with my coworker. But, this is because I'm empowered to make out of hours (overnight, weekend, holiday) calls STOP. I get woken up by something about once or twice a quarter.
When something wakes me up, the next day I start a process to ensure I don't get that same alert again: bugfixes, adjusting thresholds or time-to-critical, detecting problems and auto-remediation, determining it can be a "business hours" response.
This also requires buy-in from development. Literally yesterday I had an education opportunity with one of the developers about a ticket slated to go into production that evening that would have immediately eliminated one of our leading monitoring indicators, because it would have started creating hundreds or thousands of Sentry issues an hour. "I was thinking it was more like logging, where more information is better, where with monitoring we want the fewest messages possible."
Always, always, look at every pager hit and ask "what can prevent that from happening again?"
I was on as many as three on-call rotations for a few years. One had only two people for a while, so I was on every other week. The two things I most remember are:
* Arranging my whole life around on-call requirements. Bringing my laptop and backpack every time I went out. Designing new running routes that would use every street in a neighborhood and keep me close to home so I could respond within a 15-minute window. And yeah, the drinking thing. It pervaded my life in many ways I hadn't expected.
* Time zones and geography. These were always problematic, but especially during on-call. Often I'd narrow a problem down to a particular component that I didn't know well, so I'd try to contact the sub-team responsible for it, but nobody would respond. Then I'd try to turn the right knobs myself, and as often as not get yelled at for it in the morning. No, my afternoon, because my coworkers were three hours behind and late-commuters to boot. Of course they'd never hesitate to schedule meetings or ping me for trivial things well after my dinner.
I had taken the job, initially working on a project for which I was already a maintainer, because I wanted to avoid becoming an "architecture astronaut" by getting closer to operational reality. Indeed, I did learn a lot about how my own code behaved in real life. I don't have a problem with on-call requirements in and of themselves, but the way people and organizations handle the details is kind of <vomit emoji>.
being oncall forces the quality of software to improve.
if you want fewer incidents: ensure better QA, monitoring, smaller rollouts
usually developers start becoming more conservative after they do few oncall shifts and suddenly prioritize important reliability improvements, instead of shiny new features nobody will use
Being on-call forces the *desire* for the quality of software to improve. Shitty management can and will override that. We don't have time for QA or to waste an engineer adding monitoring, we gotta ship ship ship.
Only a manager could have such a distorted view. I'd love to work on robustness but product management has 5 years worth of feature JIRAs lined up for me.
need to bake some refactoring time into regular tickets. PM should only care about features, while software devs should provide reliable estimates on the velocity of sustainable software development
Ah, but when Alex can do 4 tickets a week baking refactoring, maintenance, tests, and observability into their work, and Blake can do 8 tickets a week focusing only on features, who do you think is going to get promoted?
These incentives then quickly devolve into a classic prisoner's dilemma. There's huge incentives to "defect" by producing quick-but-dirty work. You get the benefit of looking like you're producing rapidly, but you've made the collective experience a little bit worse.
hm... it the team is agile, then everyone does refactoring and it is team lead's job to assign tickets and evaluate. Team lead should have enough context to compare apple to apples.
if your work improving codebase is not valued, then its probably time to change job or just stop caring about code sustainability - let the business accrue technical debt, which is sometimes viable strategy if your runway and planning horizon is limited
> being oncall forces the quality of software to improve.
Only when it's the managers that are on call.
are they usually not? is there no industry standard concept of an escalation manager?
I believe that too is as the author wrote – like a disheartening number of things in the tech industry, there are no real standards around what on-call responsibilities look like. Each organization is free to set things up in whichever way suits their tastes, and the resulting practices vary widely as a result.
Yes, makes perfect sense. I know when I want my horse to go faster, I don't entice it with more carrots, I just try to find better sticks to beat it with.
this doesn't always work. many things can go wrong in distributed systems and you cannot test for all of them. also you have no control of your dependencies like when AWS networking degrades or a 3rd party API provider changes their APIs without letting you know.
True, but these things happen very very rarely. Also: 1. Is there anything you can do about it? No? Remove the alert, replace with a "we are down sorry" message. Yes? Then automate that thing.
Rinse and repeat after every incident and you will eventually get paged rarely.
I think if you have a reasonable environment, where on-call feeds back to development (which is what OP is suggesting, more or less), you will absolutely get woken up for networking problems, because there's not really an alternative. Maybe some thresholding to allow for minor problems without alerting, but you know. If it's a big enough problem, someone has to fix it, and it doesn't matter if it's your problem or your dependency's problem, it breaks your service so it's your problem. If it happens a lot, you look for another network to run on.
For 3rd party APIs, if they're not critical, you start to develop kill switches. So yeah, someone has to wake up and handle it, but all they have to do is set the kill switch and go to sleep.
Personally, I did dev and on-call for SMS/Voice verification codes. Most of the time, that's in the nasty corner of it's super critical to the application (users can't use the product if they can't get a verification code) and it depends on 3rd parties that have three nines on a good year. In my case, I got tired of dealing with the disruption and developed automated routing that could manage most providers taking an outage without needing me to take action. Results could be better if a human took notice and action, but it was good enough most of the time, and partial outages were much easier for the automation to handle it.
Even if there's no way to do something like that, at least automation can take care of 3rd party API is failing hard, so mostly return errors quickly without trying the API and only let a small fraction of requests go through to sample if the service came back online. That can keep your servers from getting overwhelmed, as well as drive the alert that helps you wake up, yell at the vendor, and decide if you can go back to sleep while the system takes care of itself.
When on-call is disconnected from development, that's when it gets really miserable. If you can swing a shift-work/follow the sun operator job, that's certainly better than on-call where incidents are common and there's no feedback loop to reduce things. It may well be better even if there is a feedback loop, but the feedback loop in that case requires explicit communication and effort; if I'm on call for my own work, I don't have to tell me to not push shit code right before I leave for the day or go to sleep, I'll get that message from myself right away. If someone else is cleaning up after my messes and doesn't communicate the effects to me, I might never know.
you are right, if your software is suffering regularly from the same issues - you go and fix them at the architecture level.
network issues? use a second DC, or some HA SDN setup, or run from a second DC.
3rd party API issues? Change vendor, or send stuff to queue to reprocess later. All of these issues could and should be solved and thats the job of the developer
There is another option. The on-call person just does a deliberately piss-poor job of resolving the problem. I mean, they resolve it but they make sure it takes a hour longer than necessary.
What are they going to do, fire you? If they make life hard for you, then get another job. The shoddier your work outside of your normal hours, the better. You can have quality, speed and cheapness, but you can only pick two.
I think there's also some middle ground where you don't go out of your way to carry a laptop but you do best effort while maintaining a normal outside-work life.
at a prev us tech job, a few years in they made all engineer have oncall, with no compensation added whatsoever, and on call for other teams code in india.
i got pages a few times and didnt exactly rush to ack. not slow but didnt rush. the rotations increased more and more (mostly from other US employees quitting) till i quit. scummy company
[dead]
TLDR; Author doesn't like being on-call when there's no concrete effort being made to solve the issue that results in calls.
The OP needs to write with some more focus. Most of this reads like a very long rant by someone who was woken up too many times recently.
> We need to talk about Kafka
No we don't, that entire section was irrelevant.
Well yeah, do you think people who are well rested and non stressed have oncall rotations they want to shove up someone’s nether regions?