Skip to content

Tempesta & backend servers health statistics #1454

@krizhanovsky

Description

@krizhanovsky

Linked with #726

Scope

Backends health status

Currently Tempesta reports the number of non 2xx responses from the backend servers only if health checking is switched on.

It's extremely useful for troubleshooting to know how many non 2xx responses were generated by each backend server as well as by Tempesta FW (e.g. @i-rinat recently observed many error responses from Tempesta on an automated tests, probably due to #940 ).

The configuration is TBD, but I propose to introduce a new configuration option

health_stat 400 5*;

to monitor 400 and all 5xx error responses produced by Tempesta FW. The statistics for each of response code matching the list must appear in /proc/tempesta/perfstat.

There also should be global and/or per-backend configuration option

 health_stat_server 400 5*;

which shows similar statistics for the per-server procfs file. If health monitoring is enabled and uses the same error codes, then there should be no doubling statistics.

I place the task for 0.9 just as #940 because the monitoring will be very useful in debugging and testing the problem.

Backend performance statisitcs

Per vhost cache misses/hits. Agreed on the meeting that we can not account cache hits per server (upstream) and only should count 200 responses per server. Probably we'll implement per-vhost perf statistic, but that will be not only cache counters and we need a separate feature request for this.

Tempesta TLS connection errors

https://www.ssllabs.com/ssltest/analyze.html?d=tempesta-tech.com reports some TLS issues, e.g. for iOS 6. However, with checking using a real device from the faulty list, I didn't reveal any issues.

We need to account TLS connection errors for the better observability.

(we have #1914 for TLS traceability)

Tempesta FW performance statistics

At the moment we gather response time percentiles for each of the backend server, but not for Tempesta FW itself. Need to provide the same response time statistics as we provide for backend servers.

Need to provide avg, 90% and max duration of client TCP connection.

Other issues

Negative values in statistics

At the current master as the date of the issue I observe negative values in the statistics, when only couple of requests were processed by Tempesta (observed for the first line only):

Minimal response time		: -1ms
Average response time		: 0ms
Median  response time		: 0ms
Maximum response time		: 0ms

Hung socket buffers

The same number of socket buffers may appear in the site statistics for relatively long time, which looks fishy (our web site statistics):

# for i in `seq 1 10`; do grep 'Socket buffers in flight' perfstat ; sleep 1; done
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49
Socket buffers in flight		: 49

Testing

  • Check there error statistics for Tempesta using wildcard and full error code matchers
  • Check there error statistics for per-backend servers using wildcard and full error code matchers
  • Check there error statistics for default backend servers using wildcard and full error code matchers
  • Check overlapping staticstics with health monitoring

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions