During this coronavirus period, it has been hard for everyone to come together at a workplace and make a decision on anything as everything has become so constantly changeable due to the ongoing nature of the global pandemic.
To help bring some clarity to the situation Michael Fisher , Group Product Manager at OpsRamp has created a list of tips for IT Ops and DevOps staff who are handling incident resolution during these trying times.
Discover his best tips for managing the online health and availability of your online enterprise below.
Coordinate across departments.
Identifying the most critical problems to solve right now is the hardest step and the one which will take the longest. Common problems these days are related to cost and scale: How do I right-size my environment to reduce costs or handle more workloads? Broaden your field of inputs to narrow down to the specific issue, such as slow page loads on the financial reporting website. Product managers, account managers and business unit leads who are the closest to the customer experience can deliver feedback on the top issues affecting customer/user satisfaction. Teams should also review recent support tickets to identify common themes of pain.
Fine-tune your approach to metrics
The Utilization Saturation and Errors (USE) Method is one way to approach the problem-first incident management process. As detailed by Brendan Gregg, a senior performance architect at Netflix, this methodology begins by posing questions, seeking answers, and working backward to the metrics. For each resource that you want to measure, identify three metrics: one for utilization, one for saturation, and one for errors. “The USE Method has made you aware of what you didn’t check: what we’re once unknown-unknowns are now known-unknowns,” Gregg explains
Create a common process for incident management across all your teams. Without the advantage of having most everyone in the same room to huddle together in an ad hoc fashion when big issues crop up, it’s imperative to institute clear steps and roles. Doing so will prevent the frustrations, confusion and oversights that needlessly delay resolution. Since most incidents are composed of multiple contributing factors, teams need to adopt a small number of user-friendly tools to document and organize the information.
At our company, we’re now using tools like Miro, an online whiteboard application, to replace our physical whiteboarding sessions. Of course, there’s also Zoom, Slack, Jira and a host of other cloud-based tools already in place at many organizations. Mandate which tools everyone should use, with some guidelines on how to use them
In some organizations, scaling requirements in response to demand have increased tenfold. Automation is playing a critical role now; moving away from a web GUI for example, is more scalable and aligns with modern tools like Chef and Puppet. User tickets can be autogenerated, for instance, from emails and linked to code management systems like GitHub. Modern development and operations teams are also expanding automation in unit testing and provisioning.
Watch for burnout.
Whether because there’s more work and/or a need to fill the hours during long quarantine days, many software engineers, testers and architects are working longer days right now. Yet exhaustion and burnout can lead to errors and oversights along with low morale. It’s up to managers to make sure that employees are taking breaks, working reasonable days and having the time and energy to attend to personal needs.