Why and how to use site reliability golden signals

Click for: original source

Engineers use SRE metrics to benchmark and improve the reliability and performance of systems and services. Learn more about the 4 golden signals (latency, errors, traffic, saturation). By @cortex.io.

Software complexity makes it harder for teams to rapidly identify and resolve issues. IT service management has evolved from an afterthought to a central part of DevOps. Microservices architectures are prone to delay or missed identification of such concerns.

Further you will learn:

  • What is site reliability engineering (SRE)?
  • The core components of site reliability engineering
  • What are SRE metrics and why are they important?
  • What are the four golden signals of SREs?
    • Latency
    • Traffic
    • Errors
    • Saturation
  • Best practices for measuring and improving SRE metrics

Your priorities will change, and your metrics should evolve with them. For one year, you might be more concerned with incident management than having your team resolve incidents rapidly. In that case, you may be interested in tracking the mean time to recovery and latency. Interesting read!

[Read More]

Tags devops app-development performance teams