-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Problem : We have two instances of opensearch one primary and one secondary with different host url's. During confugration phase lets say the secondary is not reachable the push is temporarily stopped to both the primary os and secondary os. Now during this time, the buffer data file size increases upto certain point and then the file size stops increasing even though the chunk_limit_size is much more higher.
what's expected : No data loss when restart of fluentd.
As you can see in the picture the log size seems to be somewhere around 2mb and stops increasing further , but based on the buffer settings of fluentd as shown below you can see the chunk_limit_size : 64M and we have queued chunks limit size : 100 , but still I dont see any logs getting queued either.
<buffer>
@type file
path /var/log/fluentd-edge-moss-pm-buffers/small.moss.system.buffer.pm
flush_mode interval
flush_thread_count 4
flush_interval 60s
retry_type periodic
retry_max_times 20
retry_wait 20s
chunk_limit_size 64M
queued_chunks_limit_size 100
overflow_action throw_exception
</buffer>
Now lets say somehow we have got the secondary up and running , now when the fluentd restarts it is expected to push the data which is in buffer to the opensearch but somehow the log file is removed and data is not pushed to opensearch.
so, if the secondary is down for 1 hr duration of time...the logs had to be queued and data should try to push once the connection is up and running. but we are experiencing loss of data ? why is this happening or Am I missing something here?
Also nowadays, whenever fluentd pod restarts , we are experiencing certain data points missing or loss of data.
Also Is there anyway to improve the performance of fluentd to experience minimum data loss. ?
Also, let's say :
If the time has reached to 60 seconds and one of the instance is not reachable then the chunk is flushed hence data is loss right ?