open-monitoring

A small repository to keep organized information about common monitoring and performance metrics.

Logging

Ruby stdlib: https://docs.ruby-lang.org/en/master/Logger.html

Theory

Field definition

APM client libraries

AppSignal
Sentry
OpenTelemetry
Datadog
NewRelic
(Amazon X-Ray)[https://aws.amazon.com/xray/]

Metrics

These are the 3 most widely used formats for metrics.

StatsD
Prometheus/OpenMetrics
OpenTelemetry

Popular open source backends

https://github.com/netdata/netdata, auto alerts, GPL agent, Cloud: free community tier, 4.5$/node/month unlimited.
Uptrace with very good documentation https://uptrace.dev/tools/api-monitoring-tools

Theory

RED, USE, Four signals: https://grafana.com/files/grafanacon_eu_2018/Tom_Wilkie_GrafanaCon_EU_2018.pdf
Dimensions per user and per endpoint. To show what is the user experience (high cardinality) and how each endpoint behaves (low cardinality).
Low-scale: 1–300 requests per minute (~0.1–5 RPS)
Medium-scale: 300–10,000 requests per minute (~5–166 RPS)
High-scale: 10,000+ requests per minute (~166+ RPS)

Collecting signals

In theory everything is event. Logs are one specific event, and spans are events with duration, or start and end event. Aggregations on events are in forms of metrics.

Events that are so small it's no reasonable to collect them often are gathered by periodic sampling in the form of metrics – CPU load, Memory usage, etc. For the rest of the events metrics can be calculated from the raw data, this is approach that is present in the Otel Collector where you can configure what metrics to calculate from the upcoming data. This approach is not very common.

Signal, as defined by OpenTelemetry, are mainly used for introspection about what was the state and what happened in a system. For these a UI is needed to be able to visualize the data in the form of tables, graphs and the ability to search and group it.

Dashboards

For different types of signals there is well established set of default Dashboards which are really good to have automatically. Think of CPU, Free disk, I/O, etc. And number requests, latency, throughtput. If your UI support such auto dashboards it makes your life much easier, instead of you thinking about what is nice to have visualized in the first place.

Alarms

The other important aspect of collecting observability data is to get notified when problems occur. For this it's better if you get predefined alarms for most common infrastructure issues.

Databases

When you have a database a few important things to consider are:

Having backups, a daily backup with 7 days retantion is a good starting point.
What is your data loss window in case of a crash. It's the time between disk sync and or replica sync.
Do you have replication?
Alarms about running out of disk, memory and cpu utilization above 90%.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
data-fields.md		data-fields.md
elasticsearch-types.md		elasticsearch-types.md
rails.md		rails.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

open-monitoring

Logging

Field definition

APM client libraries

Metrics

Collecting signals

Dashboards

Alarms

Databases

About

Uh oh!

Releases

Packages

License

edzhelyov/open-monitoring

Folders and files

Latest commit

History

Repository files navigation

open-monitoring

Logging

Field definition

APM client libraries

Metrics

Collecting signals

Dashboards

Alarms

Databases

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages