Wed.May 22, 2024

article thumbnail

How To Reduce MTTR

DZone

As a Site Reliability Engineer , one of the key metrics that I use to track the effectiveness of incident management is Mean Time To Recover (MTTR). Based on Wikipedia, MTTR is defined as the average time that a service or system will take to recover from any failure. Trying to achieve a low MTTR is key to achieving service level objectives and in turn, service level agreements of any critical production service. 10 Things That Can Help Reduce the Mean Time to Recovery (MTTR) 1.

Latency 130
article thumbnail

Are Your Backups Able to Pass the Unisuper Test?

Percona

You may have heard about Unisuper’s problem last week; see here. The $135 billion dollar company’s data was wiped off its cloud, and everything disappeared—zip, zero, nada. Everything they relied upon was gone. Their Google Cloud experience has been labeled an “unprecedented occurrence.

Google 87
article thumbnail

Presentation: From Mainframes to Microservices - the Journey of Building and Running Software

InfoQ

Suhail Patel discusses the platforms and software patterns that made microservices popular, and how virtual machines and containers have influenced how software is built and run at scale today.