Skip to content

edzhelyov/open-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

open-monitoring

A small repository to keep organized information about common monitoring and performance metrics.

Logging

Ruby stdlib: https://docs.ruby-lang.org/en/master/Logger.html

Theory

Field definition

APM client libraries

Metrics

These are the 3 most widely used formats for metrics.

  • StatsD
  • Prometheus/OpenMetrics
  • OpenTelemetry

Popular open source backends

Theory

  • RED, USE, Four signals: https://grafana.com/files/grafanacon_eu_2018/Tom_Wilkie_GrafanaCon_EU_2018.pdf
  • Dimensions per user and per endpoint. To show what is the user experience (high cardinality) and how each endpoint behaves (low cardinality).
  • Low-scale: 1–300 requests per minute (~0.1–5 RPS)
  • Medium-scale: 300–10,000 requests per minute (~5–166 RPS)
  • High-scale: 10,000+ requests per minute (~166+ RPS)

Collecting signals

In theory everything is event. Logs are one specific event, and spans are events with duration, or start and end event. Aggregations on events are in forms of metrics.

Events that are so small it's no reasonable to collect them often are gathered by periodic sampling in the form of metrics – CPU load, Memory usage, etc. For the rest of the events metrics can be calculated from the raw data, this is approach that is present in the Otel Collector where you can configure what metrics to calculate from the upcoming data. This approach is not very common.

Signal, as defined by OpenTelemetry, are mainly used for introspection about what was the state and what happened in a system. For these a UI is needed to be able to visualize the data in the form of tables, graphs and the ability to search and group it.

Dashboards

For different types of signals there is well established set of default Dashboards which are really good to have automatically. Think of CPU, Free disk, I/O, etc. And number requests, latency, throughtput. If your UI support such auto dashboards it makes your life much easier, instead of you thinking about what is nice to have visualized in the first place.

Alarms

The other important aspect of collecting observability data is to get notified when problems occur. For this it's better if you get predefined alarms for most common infrastructure issues.

Databases

When you have a database a few important things to consider are:

  • Having backups, a daily backup with 7 days retantion is a good starting point.
  • What is your data loss window in case of a crash. It's the time between disk sync and or replica sync.
  • Do you have replication?
  • Alarms about running out of disk, memory and cpu utilization above 90%.

About

A small repository to keep organized information about common monitoring and performance metrics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published