Skip to content

fluentd buffer not pushing to opensearch after restart. Experiencing data loss because of fluentd restart #155

@Sameer0998

Description

@Sameer0998

Problem : We have two instances of opensearch one primary and one secondary with different host url's. During confugration phase lets say the secondary is not reachable the push is temporarily stopped to both the primary os and secondary os. Now during this time, the buffer data file size increases upto certain point and then the file size stops increasing even though the chunk_limit_size is much more higher.

what's expected : No data loss when restart of fluentd.

Image

As you can see in the picture the log size seems to be somewhere around 2mb and stops increasing further , but based on the buffer settings of fluentd as shown below you can see the chunk_limit_size : 64M and we have queued chunks limit size : 100 , but still I dont see any logs getting queued either.

        <buffer>
          @type file
          path /var/log/fluentd-edge-moss-pm-buffers/small.moss.system.buffer.pm
          flush_mode interval
          flush_thread_count 4
          flush_interval 60s
          retry_type periodic
          retry_max_times 20
          retry_wait 20s
          chunk_limit_size 64M
          queued_chunks_limit_size 100
          overflow_action throw_exception
        </buffer>

Now lets say somehow we have got the secondary up and running , now when the fluentd restarts it is expected to push the data which is in buffer to the opensearch but somehow the log file is removed and data is not pushed to opensearch.

so, if the secondary is down for 1 hr duration of time...the logs had to be queued and data should try to push once the connection is up and running. but we are experiencing loss of data ? why is this happening or Am I missing something here?

Also nowadays, whenever fluentd pod restarts , we are experiencing certain data points missing or loss of data.
Also Is there anyway to improve the performance of fluentd to experience minimum data loss. ?

Also, let's say :
If the time has reached to 60 seconds and one of the instance is not reachable then the chunk is flushed hence data is loss right ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions