【Hacker News搬运】作为一名工程经理,如何管理oncall?
-
Title: How to manage oncall as an engineering manager?
作为一名工程经理,如何管理oncall?
Text: As a relatively new engineering manager, I oversee a team handling a moderate volume of on-call issues (typically 4-5 per week). In addition to managing production incidents, our on-call responsibilities extend to monitoring application and infrastructure alerts.<p>The challenge I’m currently facing is ensuring that our on-call engineers don't have sufficient time to focus on system improvements, particularly enhancing operational experience (Opex). Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.<p>I am looking for a framework that will allow me to:<p>Clearly define on-call priorities, balancing immediate production needs with Opex improvements.
Manage long-term fixes related to past on-call issues without overwhelming current on-call engineers.
Create a structured approach that ensures ongoing focus on improving operational experience over time.作为一名相对较新的工程经理,我负责管理一个处理中等数量随叫随到问题的团队(通常每周4-5个)。除了管理生产事件外,我们的随叫随到职责还延伸到监控应用程序和基础设施警报<p> 我目前面临的挑战是确保我们的随叫随到的工程师不会;没有足够的时间专注于系统改进,特别是增强运营体验(Opex)。通常,随叫随到的工程师被拉到生产功能或以前问题的长期修复上,几乎没有为主动的系统改进留下带宽<p> 我正在寻找一个框架,使我能够:<p>明确定义随叫随到的优先事项,平衡当前的生产需求和运营支出的改进。管理与过去随叫随到问题相关的长期修复,而不会压倒当前的随叫随叫到工程师。创建一种结构化的方法,确保随着时间的推移,持续关注改善运营体验。
Url:
Post by: frugal10
Comments:
cbanek: I've been on a lot of oncall lists... 4-5 per week seems extremely high to me. Have you gathered up and classified what the issues were? Are there any patterns or areas of the code that seem to be problematic? Are you actually fixing and getting to the root cause of issues or are they getting worse? It sounds like you don't know the answer because you don't really understand the problem.<p>If you don't have enough time to run the system and you have to do new feature work one has to give into the other, or you have to hire additional people (but this rarely solves the problem, if anything, it tends to make it worse for a while until the new person figures out their bearings).<p>One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues and investigating/fixing on call issues for the period of time they are on-call, and if there isn't anything on fire, let them improve the system. This helps with things like comp-time ("worked all night on the issue, now I have to show up all day tomorrow too???") and letting people actually fix issues rather than just restart services. It also gives agency to the on-call person to help fix the problems, rather than just deal with them.
cbanek: 我;我上过很多oncall名单。。。每周4-5次对我来说似乎非常高。你收集并分类了问题是什么吗?代码中是否有任何模式或区域似乎有问题?你真的在解决问题并找到问题的根源吗?还是问题正在恶化?听起来你好像不知道;我不知道答案,因为你不知道;我真的不明白这个问题<p> 如果你不这样做;没有足够的时间运行系统,你必须做一个必须交给另一个的新功能工作,或者你必须雇佣额外的人(但这很少能解决问题,如果有的话,它往往会在一段时间内使情况变得更糟,直到新人弄清楚自己的方向)<p> 一种非常简单但并不容易的方法是让随叫随到的工程师不做功能工作,而只处理随叫随叫到的问题和调查;在他们待命的时间段内修复待命问题,如果没有;没有什么问题,让他们改进系统。这有助于解决诸如补偿时间(";在这个问题上工作了一整夜,现在我明天也必须一整天都来上班。它还为随叫随到的人提供了帮助解决问题的代理权,而不仅仅是处理问题。
gobins: A few things that worked for us:<p>1. The roster is set weekly. You need at least 4-5 engineers so that you get rostered not more than once per month. Anything more than that and you will get your engineers burned out.<p>2. There is always a primary and secondary. Secondary gets called up in cases when primary cannot be reached.<p>3. You are expected to triage the issues that comes during your on-call roster but not expected to work on long term fixes. that is something you have to bring to the team discussion and allocate. No one wants to do too much off maintenance work.<p>4. Your top priorities to work on should be issues that come up repeatedly and burn your productivity. This could take upto a year. Once things settle down, your engineers should be free enough to work in things that they are interested in.<p>5. For any cross team collaboration that takes more than a day, the manager should be the point of contact so that your engineers don't get shoulder tapped and get pulled away from things that they are working on.<p>Hope this helps.
gobins: 有几件事对我们有用:<p>1。名册每周设定一次。你需要至少4-5名工程师,这样你每月的排班次数就不会超过一次。任何超过这一点的事情都会让你的工程师精疲力竭<p> 2。总是有一个主要和次要的。当无法联系到小学时,中学会被调用<p> 3。你需要对随叫随到的名单中出现的问题进行分类,但不需要进行长期修复。这是你必须带到团队讨论中并分配的东西。没有人愿意做太多的非维护工作。<p>4。你的首要任务应该是反复出现并消耗你生产力的问题。这可能需要长达一年的时间。一旦事情稳定下来,你的工程师应该有足够的自由去做他们感兴趣的事情。<p>5。对于任何需要一天以上时间的跨团队协作,经理都应该是联系人,这样你的工程师就不会;不要被拍肩膀,也不要被拉开他们正在做的事情。<p>希望这能有所帮助。
seniortaco: 4-5 issues per week can be a lot or a little, all depending on the severity of these issues. Likely most of the them are recurring issues your team sees a few times a month and the root cause hasn't been addressed and needs to be.<p>Driving down oncall load is all about working smarter, not necessarily harder. 30% of the issues likely need to be fixed by another team. This needs to be identified ASAP and the issues handed off so that they can parallelize the work while your team focuses on the issues you "own".<p>Setup a weekly rotation for issue triage and mitigation. The engineer oncall should respond to issues, prioritize based on severity, mitigate impact, and create and track Root Cause issues to fix the root cause. These should go into an operational backlog. This is 1 full time headcount on your team (but rotated).<p>To address the operational backlog, you need to build role expectations with your entire team. It helps if leadership is involved. Everyone needs to understand that in terms of career progression and performance evaluation, operational excellence is one of several role requirements. With these expectations clearly set, review progress with your directs in recurring 1-1s to ensure they are picking up and addressing operational excellence work, driving down the backlog.
seniortaco: 每周4-5个问题可能很多,也可能很少,这取决于这些问题的严重程度。其中大多数可能是您的团队每月看到几次的重复问题,而根本原因尚未解决;没有得到解决,也需要得到解决。<p>降低oncall负载意味着更聪明地工作,而不一定更努力。30%的问题可能需要由另一个团队解决。需要尽快确定这一点,并移交问题,以便在你的团队专注于你的问题时,这些问题可以并行处理。";自己的"<p> 为问题分类和缓解设置每周轮换。工程师应响应问题,根据严重程度确定优先级,减轻影响,并创建和跟踪根本原因问题以解决根本原因。这些应该列入业务积压。这是你团队中的1名全职员工(但需要轮换)<p> 为了解决运营积压问题,您需要与整个团队建立角色期望。如果涉及领导力,这会有所帮助。每个人都需要明白,在职业发展和绩效评估方面,卓越运营是几个角色要求之一。在明确设定这些期望的情况下,与您的主管一起定期审查1-1的进度,以确保他们正在着手处理卓越运营工作,减少积压。
tthflssy: Without knowing your context, it is hard to give advice, that is ready to be applied. As a manager, you will need to collect and produce data about what is really happening and what is the root cause.<p>Clear up first what is the charter of your team, what should be in your team's ownership? Do you have to do everything you are doing today? Can you say no to production feature development for some time? Who do you need to convince: your team, your manager or the whole company?<p>Figure out how to measure / assign value to opex improvements eg you will have only 1-2 on-call issues per week instead of 4-5, and that is savings in engineering time, measurable in reliability (SLA/SLO as mentioned in another comment) - then you will understand how much time it is worth to spend on those fixes and which opex ideas worth pursuing.<p>Improving the efficiency of your team: are they making the right decisions and taking the right initiatives / tickets?<p>Argue for headcount and you will have more bandwidth after some time. Or split 2 people off and they should only work on opex improvements. You give administratively priority to these initiatives (if the rest of the team can handle on-call).
tthflssy: 如果不了解你的背景,就很难给出可以应用的建议。作为一名管理者,您需要收集和生成有关实际发生的事情以及根本原因的数据<p> 首先明确你的团队章程是什么,你的团队应该有什么;所有权?你今天做的每件事都必须做吗?你能在一段时间内拒绝生产功能开发吗?你需要说服谁:你的团队、你的经理还是整个公司<p> 了解如何测量;为运营支出改进赋予价值,例如,您每周只有1-2个待处理问题,而不是4-5个,这节省了工程时间,在可靠性方面是可衡量的(另一条评论中提到的SLA·SLO),那么您将了解在这些修复上花费多少时间是值得的,以及哪些运营支出想法值得追求<p> 提高团队效率:他们是否做出了正确的决定并采取了正确的举措;票<p> 争取员工人数,一段时间后你会有更多的带宽。或者把两个人分开,他们只应该致力于运营支出的改进。你在行政上优先考虑这些举措(如果团队的其他成员可以随时处理)。
matt_s: Think of on-call like medical triage. On-call should triage outage (partial/full) level scenarios and respond to alerts, take immediate actions to remedy the situation (restart services, scale up, etc.) and then create follow-on tickets to address root causes that go into the pool of work the entire team works. Like an ER team stabilizing a patient and identifying next steps or sending the patient off to a different team to take time in solving their longer term issue.<p>The team needs to collectively work project work and opex work coming from on-call. On-call should be a rotation through the team. Runbooks should be created on how to deal with scenarios and iterated on to keep updated.<p>Project work and opex work are related, if you have a separate team dealing with on-call from project work then there isn't a sense of ownership of the product since its like throwing things over a wall to another team to deal with cleaning up a mess.
matt_s: 想想随叫随到的医疗分流。随叫随到应该对停机(部分或全部)级别的情况进行分类,并对警报做出响应,立即采取行动纠正这种情况(重启服务、扩大规模等),然后创建后续工单,以解决整个团队工作池中的根本原因。就像急诊团队稳定患者并确定下一步行动,或者将患者送往另一个团队,花时间解决他们的长期问题<p> 团队需要集体处理项目工作和运营支出工作,这些工作都是随叫随到的。随叫随到应该是整个团队的轮换。应该创建关于如何处理场景的运行手册,并不断迭代以保持更新<p> 项目工作和运营支出工作是相关的,如果你有一个单独的团队处理项目工作的随叫随到,那么就没有了;对产品没有主人翁意识,因为这就像把东西扔到墙上,让另一个团队来收拾残局。