Human factor: how companies can avoid cloud disasters

October 19, 2024 Editorial Staff 324 Views 0 Comments

Join our daily and weekly emails to receive the latest updates on AI and exclusive content. Learn More

Large businesses work hard to ensure that their services do not go down. The reason is simple: significant outages can hurt your brand, driving customers to other products.

Building an internet service that is reliable can be a difficult technical challenge, but it’s also a challenging task for leaders. It can be hard to motivate your engineering team to invest in reliability because it’s often perceived as less exciting than creating new features. Top tech companies have thousands of employees, and they operate hundreds of services on the internet. They have developed clever ways over the years to ensure that their engineers build reliable system. This article discusses the human engineering techniques used by some of the most successful tech firms in history. Spin the wheel.

The AWS Operational Review is a meeting that’s open to everyone in the company. A “wheel of Fortune” is used to randomly select an AWS service for live review at each meeting. The team being reviewed must answer questions from operational leaders who have experience about dashboards and metrics. Attendees include hundreds of employees as well as dozens of directors, and several vice presidents. This encourages all teams to achieve a minimum level of operational competency. As a manager, or tech lead, it is important to not appear clueless before the entire company if your luck fails. It is essential that you review your reliability metrics regularly. Leaders who are actively interested in the operational health of their organization set the tone for all employees. The wheel is one way to achieve this. What do you do during these operational reviews, though? Next, we need to define measurable reliability goals. Live interactions (chats) have a lower tolerance for latency than asynchronous workloads such as training a machine-learning model or uploading a movie. Your goals should reflect the concerns of your customers. When you are reviewing a team’s metrics, have them describe the measurable reliability goals. Be sure to understand – and have them understand – why these goals were selected. Use dashboards to show that these goals are being achieved. You can prioritize reliability in a data driven manner by setting measurable goals. It is important to concentrate on detecting issues. Ask them if they can explain any anomalies you notice in their dashboards. Also, inquire if their on-call has been notified. You should be able to detect problems before your clients.

Embrace Chaos

One revolutionary shift in the cloud resilience mindset is to inject failure into the production. Netflix formalized this concept as “chaos engineering” — and the idea is as cool as the name suggests.

Netflix wanted to incentivize its engineers to build fault tolerant systems without resorting to micromanagement. The engineers reasoned, if systemic failure becomes the norm and not the exception, they will have no other choice than to create fault-tolerant systems. Netflix has been able to achieve this goal after a long time. Each service should be able to absorb these failures without affecting service availability. This strategy is costly and complex. If you are shipping a product that requires a high level of uptime, failure injection is a great way to achieve something similar to a “correctness proof”. Introduce it as soon as you can if your product requires this. There has never been a time when it was easier or more affordable than today. If chaos engineering is too much, then you can require your teams at least to perform simulated outage runs (game days) every year or two, or before any major feature launches. You will be assigned three roles during a game-day. The first is to simulate the outage. The second, to fix it without knowing what went wrong, and the third, to observe and take detailed notes. The team should then do a joint post-mortem of the incident. (See below) Have a thorough post-mortem procedure

A post-mortem is a good indicator of a company’s culture. The top tech companies all require their teams to produce post-mortems after major outages. The report should explain the incident, examine its root causes and identify prevention measures. The post-mortem process should be strict and adhere to high standards, but it should not single out any individuals as the culprits. The purpose of a post-mortem is to correct, not punish. There are often underlying problems that led to an engineer making a mistake. You may need to improve your testing or put better guardrails on critical systems. Fix those systemic problems. It’s possible to write an entire article on how to design a robust process for post-mortem analysis, but we can safely say that it will help prevent the next outage. Reward reliability work

If engineers believe that promotions and raises are only available for new features, they will put reliability on the back burner. The majority of engineers should contribute to operational excellence regardless of their seniority. In your performance evaluations, reward improvements in reliability. Hold your senior-most engineers accountable for the stability of the systems they oversee.

While this recommendation may seem obvious, it is surprisingly easy to miss.

Conclusion

In this article, we explored some fundamental tools that embed reliability into your company culture. Early-stage and startup companies do not usually prioritize reliability. It’s understandable – your fledgling business must be obsessively focused to prove product-market fit in order to ensure survival. Once you have a loyal customer base, your future depends on maintaining trust. People earn trust by being dependable. It is the same for internet services.

Aditya Visweswaran is a senior software engineer at Google Cloud’s security platform team.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Human factor: how companies can avoid cloud disasters

One revolutionary shift in the cloud resilience mindset is to inject failure into the production. Netflix formalized this concept as “chaos engineering” — and the idea is as cool as the name suggests.

Conclusion

Welcome to the VentureBeat community!

story originally seen here

Editorial Staff

Leave a Reply Cancel reply