-
Notifications
You must be signed in to change notification settings - Fork 107
Description
Linked with #726
Scope
Backends health status
Currently Tempesta reports the number of non 2xx responses from the backend servers only if health checking is switched on.
It's extremely useful for troubleshooting to know how many non 2xx responses were generated by each backend server as well as by Tempesta FW (e.g. @i-rinat recently observed many error responses from Tempesta on an automated tests, probably due to #940 ).
The configuration is TBD, but I propose to introduce a new configuration option
health_stat 400 5*;
to monitor 400
and all 5xx
error responses produced by Tempesta FW. The statistics for each of response code matching the list must appear in /proc/tempesta/perfstat
.
There also should be global and/or per-backend configuration option
health_stat_server 400 5*;
which shows similar statistics for the per-server procfs file. If health monitoring is enabled and uses the same error codes, then there should be no doubling statistics.
I place the task for 0.9 just as #940 because the monitoring will be very useful in debugging and testing the problem.
Backend performance statisitcs
Per vhost cache misses/hits. Agreed on the meeting that we can not account cache hits per server (upstream) and only should count 200 responses per server. Probably we'll implement per-vhost perf statistic, but that will be not only cache counters and we need a separate feature request for this.
Tempesta TLS connection errors
https://www.ssllabs.com/ssltest/analyze.html?d=tempesta-tech.com reports some TLS issues, e.g. for iOS 6. However, with checking using a real device from the faulty list, I didn't reveal any issues.
We need to account TLS connection errors for the better observability.
(we have #1914 for TLS traceability)
Tempesta FW performance statistics
At the moment we gather response time percentiles for each of the backend server, but not for Tempesta FW itself. Need to provide the same response time statistics as we provide for backend servers.
Need to provide avg, 90% and max duration of client TCP connection.
Other issues
Negative values in statistics
At the current master as the date of the issue I observe negative values in the statistics, when only couple of requests were processed by Tempesta (observed for the first line only):
Minimal response time : -1ms
Average response time : 0ms
Median response time : 0ms
Maximum response time : 0ms
Hung socket buffers
The same number of socket buffers may appear in the site statistics for relatively long time, which looks fishy (our web site statistics):
# for i in `seq 1 10`; do grep 'Socket buffers in flight' perfstat ; sleep 1; done
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Socket buffers in flight : 49
Testing
- Check there error statistics for Tempesta using wildcard and full error code matchers
- Check there error statistics for per-backend servers using wildcard and full error code matchers
- Check there error statistics for default backend servers using wildcard and full error code matchers
- Check overlapping staticstics with health monitoring