Skip to content

Fluentd on K8s stopped flushing logs to Elastic #732

@nanorobocop

Description

@nanorobocop

(check apply)

Problem

We have Fluentd running in Daemonset (using fluentd-kubernetes-daemonset).
At some point almost all instances of Fluentd stop flushing their queue.

screenshot

I checked related issues like #525 and rancher/rancher#25819 and #600, but no luck.

Elastic is behind load balancer, so we set following setting:

      reload_connections false
      reconnect_on_error true
      reload_on_failure false

No errors or warning messages appears in fluentd logs.

I checked lsof: buffer and queue files are opened by ruby process:

root@fluentd-5m67p:/home/fluent# ps auxf
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       6811  0.0  0.0   9744  2528 ?        Ss   08:53   0:00 bash
root       6818  0.0  0.0  13244  1784 ?        R+   08:53   0:00  \_ ps auxf
root          1  0.0  0.0   8288   168 ?        Ss   Mar12   0:07 tini -- /fluentd/entrypoint.sh
root          7  0.0  0.0 147120 59736 ?        Sl   Mar12   1:34 ruby /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluentd.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd
root         19  0.2  0.1 391980 92724 ?        Sl   Mar12  16:08  \_ /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluentd.conf -p /fluentd/plu
root@fluentd-5m67p:/home/fluent# lsof | grep /var/log/fluentd | head
ruby      19                root    6r      REG              253,2      702 260204793 /var/log/fluentd-es.buffer/buffer.q5a0ed281c815ce1d996b701324ef2f34.log
ruby      19                root   16u      REG              253,2      415   1041535 /var/log/fluentd-rancher-rke.pos
ruby      19                root   20r      REG              253,2  3725271 260126777 /var/log/fluentd-es.buffer/buffer.q5a0ed2a2be44d342cc4ace0849b87681.log
ruby      19                root   21r      REG              253,2      185 260204768 /var/log/fluentd-es.buffer/buffer.q5a0ed2a2be44d342cc4ace0849b87681.log.meta
ruby      19                root   22u      REG              253,2     4956 260204774 /var/log/fluentd-es.buffer/buffer.b5a0ed2a2be7851b8fed18bb5745d3821.log
ruby      19                root   24u      REG              253,2     1581   1041536 /var/log/fluentd-k8s.pos
ruby      19                root   29r      REG              253,2     9154 260204777 /var/log/fluentd-es.buffer/buffer.q5a0ed260efbd77103296f04691540f76.log
ruby      19                root   30r      REG              253,2      188 260204782 /var/log/fluentd-es.buffer/buffer.q5a0ed260efbd77103296f04691540f76.log.meta
ruby      19                root   32r      REG              253,2    18877 260204786 /var/log/fluentd-es.buffer/buffer.q5a0ed263ccf18519cdc1081e7bdf2d8b.log
ruby      19                root   33r      REG              253,2      183 260204787 /var/log/fluentd-es.buffer/buffer.q5a0ed263ccf18519cdc1081e7bdf2d8b.log.meta

Growing queue (there're much more files on other instances):

root@fluentd-5m67p:/home/fluent# ls -laht /var/log/fluentd-es.buffer/
total 22M
-rw-r--r--   1 root root 1.8M Mar 16 09:11 buffer.b5a0f3a0610eeb354b96a33d8f58bb903.log
-rw-r--r--   1 root root  185 Mar 16 09:11 buffer.b5a0f3a0610eeb354b96a33d8f58bb903.log.meta
-rw-r--r--   1 root root 841K Mar 16 09:11 buffer.b5a0ed2a3b19cc6bd31a6def5ad27d677.log
-rw-r--r--   1 root root  190 Mar 16 09:11 buffer.b5a0ed2a3b19cc6bd31a6def5ad27d677.log.meta
-rw-r--r--   1 root root 8.4M Mar 16 09:11 buffer.b5a0ed281c85b0bff32ec4dd6f6a91e06.log
-rw-r--r--   1 root root  222 Mar 16 09:11 buffer.b5a0ed281c85b0bff32ec4dd6f6a91e06.log.meta
-rw-r--r--   1 root root 183K Mar 16 09:11 buffer.b5a0ed2e4f76f4d08299d3adb6cedeb29.log
-rw-r--r--   1 root root   83 Mar 16 09:11 buffer.b5a0ed2e4f76f4d08299d3adb6cedeb29.log.meta
-rw-r--r--   1 root root 1.9M Mar 16 09:10 buffer.b5a0ed2d32a2502615f507638c8a42711.log
-rw-r--r--   1 root root  208 Mar 16 09:10 buffer.b5a0ed2d32a2502615f507638c8a42711.log.meta
-rw-r--r--   1 root root  17K Mar 16 08:29 buffer.b5a0ede80392ff3bddf11cf8e709a3259.log
-rw-r--r--   1 root root  205 Mar 16 08:29 buffer.b5a0ede80392ff3bddf11cf8e709a3259.log.meta
-rw-r--r--   1 root root 4.9K Mar 16 08:12 buffer.b5a0ed2a2be7851b8fed18bb5745d3821.log
-rw-r--r--   1 root root   82 Mar 16 08:12 buffer.b5a0ed2a2be7851b8fed18bb5745d3821.log.meta
-rw-r--r--   1 root root  128 Mar 16 08:12 buffer.b5a0f4612f014c24904ea00e1314342c3.log
-rw-r--r--   1 root root   81 Mar 16 08:12 buffer.b5a0f4612f014c24904ea00e1314342c3.log.meta
drwxr-xr-x   2 root root 4.0K Mar 16 08:12 .
drwxr-xr-x. 26 root root 4.0K Mar 16 08:04 ..
-rw-r--r--   1 root root 3.6M Mar 16 07:18 buffer.q5a0f064c4c6191fe52377e86b93ac5c1.log
-rw-r--r--   1 root root  185 Mar 16 07:18 buffer.q5a0f064c4c6191fe52377e86b93ac5c1.log.meta
-rw-r--r--   1 root root 3.6M Mar 16 03:27 buffer.q5a0ed2a2be44d342cc4ace0849b87681.log
-rw-r--r--   1 root root  185 Mar 16 03:27 buffer.q5a0ed2a2be44d342cc4ace0849b87681.log.meta
-rw-r--r--   1 root root   81 Mar 15 23:36 buffer.q5a0ed281c815ce1d996b701324ef2f34.log.meta
-rw-r--r--   1 root root  702 Mar 15 23:36 buffer.q5a0ed281c815ce1d996b701324ef2f34.log
-rw-r--r--   1 root root  183 Mar 15 23:35 buffer.q5a0ed263ccf18519cdc1081e7bdf2d8b.log.meta
-rw-r--r--   1 root root  188 Mar 15 23:35 buffer.q5a0ed260efbd77103296f04691540f76.log.meta
-rw-r--r--   1 root root  19K Mar 15 23:35 buffer.q5a0ed263ccf18519cdc1081e7bdf2d8b.log
-rw-r--r--   1 root root 9.0K Mar 15 23:35 buffer.q5a0ed260efbd77103296f04691540f76.log
-rw-r--r--   1 root root  206 Mar 15 23:27 buffer.q5a0ed096f5d6c33c5cb94a686e5a54c4.log.meta
-rw-r--r--   1 root root 6.8K Mar 15 23:26 buffer.q5a0ed096f5d6c33c5cb94a686e5a54c4.log

There're connections to destination host 10.32.150.166:12000, but no traffic:

root@fluentd-5m67p:/home/fluent# netstat -ntp
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.42.28.8:60316        10.32.150.166:12000     ESTABLISHED 19/ruby
tcp        0      0 10.42.28.8:8080         10.42.12.198:45760      TIME_WAIT   -
tcp        0      0 10.42.28.8:33176        10.32.150.166:12000     ESTABLISHED 19/ruby

Steps to replicate

root@fluentd-5m67p:/fluentd# bundle install --binstubs
[DEPRECATED] The --binstubs option will be removed in favor of `bundle binstubs`
Using rake 13.0.1
Using public_suffix 4.0.3
Using addressable 2.7.0
Using bundler 2.1.2
Using concurrent-ruby 1.1.6
Using cool.io 1.6.0
Using unf_ext 0.0.7.6
Using unf 0.1.4
Using domain_name 0.5.20190701
Using multi_json 1.14.1
Using elasticsearch-api 6.8.1
Using multipart-post 2.1.1
Using faraday 0.17.3
Using elasticsearch-transport 6.8.1
Using elasticsearch 6.8.1
Using excon 0.72.0
Using ffi 1.12.2
Using ffi-compiler 1.0.1
Using http_parser.rb 0.6.0
Using msgpack 1.3.3
Using sigdump 0.2.4
Using serverengine 2.2.1
Using strptime 0.2.3
Using tzinfo 2.0.1
Using tzinfo-data 1.2019.3
Using yajl-ruby 1.4.1
Using fluentd 1.9.2
Using fluent-config-regexp-type 1.0.0
Using fluent-plugin-concat 2.4.0
Using fluent-plugin-detect-exceptions 0.0.13
Using fluent-plugin-elasticsearch 4.0.4
Using fluent-plugin-grok-parser 2.6.1
Using fluent-plugin-json-in-json-2 1.0.2
Using http-cookie 1.0.3
Using http-form_data 2.2.0
Using http-parser 1.2.1
Using http 4.3.0
Using recursive-open-struct 1.1.0
Using http-accept 1.7.0
Using mime-types-data 3.2019.1009
Using mime-types 3.3.1
Using netrc 0.11.0
Using rest-client 2.1.0
Using kubeclient 4.6.0
Using lru_redux 1.1.0
Using fluent-plugin-kubernetes_metadata_filter 2.3.0
Using fluent-plugin-multi-format-parser 1.0.0
Using quantile 0.2.1
Using prometheus-client 0.9.0
Using fluent-plugin-prometheus 1.6.1
Using fluent-plugin-record-modifier 2.0.1
Using fluent-plugin-rewrite-tag-filter 2.2.0
Using systemd-journal 1.3.3
Using fluent-plugin-systemd 1.0.2
Using oj 3.8.1
Bundle complete! 15 Gemfile dependencies, 55 gems now installed.
Bundled gems are installed into `./vendor/bundle`

fluentd.conf

<source>
  @type prometheus
  bind 0.0.0.0
  port "#{ENV['FLUENTD_PROMETHEUS_PORT']}"
</source>

<source>
  @type prometheus_monitor
  <labels>
    cluster "#{ENV['CLUSTER_NAME']}"
    node "#{ENV['NODE_NAME']}"
    deployment es
  </labels>
</source>

<source>
  @type prometheus_output_monitor
  <labels>
    cluster "#{ENV['CLUSTER_NAME']}"
    node "#{ENV['NODE_NAME']}"
    deployment es
  </labels>
</source>

<source>
  @type prometheus_tail_monitor
  <labels>
    cluster "#{ENV['CLUSTER_NAME']}"
    node "#{ENV['NODE_NAME']}"
    deployment es
  </labels>
</source>


# logs of containers maintained by k8s
<source>
  @type tail
  read_from_head true
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-k8s.pos
  time_format %Y-%m-%dT%H:%M:%S.%L
  tag k8s.*
  format json
</source>


# logs of containers managed by RKE
<source>
  @type tail
  read_from_head true
  path /var/lib/rancher/rke/log/*.log
  pos_file /var/log/fluentd-rancher-rke.pos
  time_format %Y-%m-%dT%H:%M:%S.%L
  tag rancher-rke.*
  format json
</source>

<filter k8s.**>
  @type kubernetes_metadata
  kubernetes_url https://kubernetes.default.svc:443
  ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
</filter>

<filter **>
  @type record_transformer
  <record>
    projectName es
    projectVersion "1.0.0"
    logType ${tag_parts[0]}
    file ${tag}
  </record>
</filter>

<match **>
  @type elasticsearch
  host es.example.com
  port 12000
  user user
  password ***
  scheme https
        ssl_verify false
  ssl_version TLSv1_2
        logstash_format true
  reload_connections false
  reconnect_on_error true
  reload_on_failure false
  <buffer>
    @type file
    path /var/log/fluentd-es.buffer
    flush_at_shutdown true
    flush_mode interval
    flush_interval 60s
    flush_thread_count 2
    chunk_limit_size 10MB
    chunk_limit_records 10000
  </buffer>
  flatten_hashes true
  flatten_hashes_separator _
  @log_level debug

Expected Behavior or What you need to ask

...

Using Fluentd and ES plugin versions

  • OS version: CentOS Linux release 7.7.1908 (Core)
  • Kubernetes: 1.10.3-rancher1.1+218721003ebdda
  • Fluentd:
    • fluent/fluentd-kubernetes-daemonset:v1.9.2-debian-elasticsearch6-1.0
    • fluentd 1.9.2
  • ES plugin 4.0.3
    • paste boot log of fluentd or td-agent
w022571811645m:~ JP25060$ kubectl -n efk logs   fluentd-5m67p   | head -n 500
2020-03-12 10:10:23 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluentd.conf"
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-concat' version '2.4.0'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.13'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '4.0.4'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-grok-parser' version '2.6.1'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-json-in-json-2' version '1.0.2'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.3.0'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-prometheus' version '1.6.1'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.0.1'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.2.0'
2020-03-12 10:10:23 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2020-03-12 10:10:23 +0000 [info]: gem 'fluentd' version '1.9.2'
2020-03-12 10:10:23 +0000 [debug]: 'host es.example.com' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'host: es.example.com' doesn't have tag placeholder
2020-03-12 10:10:23 +0000 [debug]: 'index_name fluentd' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'index_name: fluentd' doesn't have tag placeholder
2020-03-12 10:10:23 +0000 [debug]: 'template_name ' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'template_name: ' doesn't have tag placeholder
2020-03-12 10:10:23 +0000 [debug]: 'logstash_prefix logstash' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'logstash_prefix: logstash' doesn't have tag placeholder
2020-03-12 10:10:23 +0000 [debug]: 'deflector_alias ' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'deflector_alias: ' doesn't have tag placeholder
2020-03-12 10:10:23 +0000 [debug]: 'application_name default' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'application_name: default' doesn't have tag placeholder
2020-03-12 10:10:23 +0000 [debug]: Need substitution: false
2020-03-12 10:10:23 +0000 [debug]: 'host_placeholder es.example.com' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'host_placeholder: es.example.com' doesn't have tag placeholder
2020-03-12 10:10:23 +0000 [warn]: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2020-03-12 10:10:23 +0000 [info]: using configuration file: <ROOT>
  <source>
    @type prometheus
    bind "0.0.0.0"
    port 8080
  </source>
  <source>
    @type prometheus_monitor
    <labels>
      cluster local
      node worker007
      deployment es
    </labels>
  </source>
  <source>
    @type prometheus_output_monitor
    <labels>
      cluster local
      node worker007
      deployment es
    </labels>
  </source>
  <source>
    @type prometheus_tail_monitor
    <labels>
      cluster local
      node worker007
      deployment es
    </labels>
  </source>
  <source>
    @type tail
    read_from_head true
    path "/var/log/containers/*.log"
    pos_file "/var/log/fluentd-k8s.pos"
    time_format %Y-%m-%dT%H:%M:%S.%L
    tag "k8s.*"
    format json
    <parse>
      time_format %Y-%m-%dT%H:%M:%S.%L
      @type json
      time_type string
    </parse>
  </source>
  <source>
    @type tail
    read_from_head true
    path "/var/lib/rancher/rke/log/*.log"
    pos_file "/var/log/fluentd-rancher-rke.pos"
    time_format %Y-%m-%dT%H:%M:%S.%L
    tag "rancher-rke.*"
    format json
    <parse>
      time_format %Y-%m-%dT%H:%M:%S.%L
      @type json
      time_type string
    </parse>
  </source>
  <filter k8s.**>
    @type kubernetes_metadata
    kubernetes_url "https://kubernetes.default.svc:443"
    ca_file "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
    bearer_token_file "/var/run/secrets/kubernetes.io/serviceaccount/token"
  </filter>
  <filter **>
    @type record_transformer
    <record>
      projectName es
      projectVersion 1.0.0
      logType ${tag_parts[0]}
      file ${tag}
    </record>
  </filter>
  <match **>
    @type elasticsearch
    host "es.example.com"
    port 12000
    user "user"
    password xxxxxx
    scheme https
    ssl_verify false
    ssl_version TLSv1_2
    logstash_format true
    reload_connections false
    reconnect_on_error true
    reload_on_failure false
    flatten_hashes true
    flatten_hashes_separator "_"
    @log_level "debug"
    <buffer>
      @type "file"
      path "/var/log/fluentd-es.buffer"
      flush_at_shutdown true
      flush_mode interval
      flush_interval 60s
      flush_thread_count 2
      chunk_limit_size 10MB
      chunk_limit_records 10000
    </buffer>
  </match>
</ROOT>
2020-03-12 10:10:23 +0000 [info]: starting fluentd-1.9.2 pid=7 ruby="2.6.5"
2020-03-12 10:10:23 +0000 [info]: spawn command to main:  cmdline=["/usr/local/bin/ruby", "-Eascii-8bit:ascii-8bit", "/fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd", "-c", "/fluentd/etc/fluentd.conf", "-p", "/fluentd/plugins", "--gemfile", "/fluentd/Gemfile", "-r", "/fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-4.0.4/lib/fluent/plugin/elasticsearch_simple_sniffer.rb", "--under-supervisor"]
2020-03-12 10:10:24 +0000 [info]: adding filter pattern="k8s.**" type="kubernetes_metadata"
2020-03-12 10:10:25 +0000 [info]: adding filter pattern="**" type="record_transformer"
2020-03-12 10:10:25 +0000 [info]: adding match pattern="**" type="elasticsearch"
2020-03-12 10:10:25 +0000 [debug]: #0 'host es.example.com' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'host: es.example.com' doesn't have tag placeholder
2020-03-12 10:10:25 +0000 [debug]: #0 'index_name fluentd' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'index_name: fluentd' doesn't have tag placeholder
2020-03-12 10:10:25 +0000 [debug]: #0 'template_name ' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'template_name: ' doesn't have tag placeholder
2020-03-12 10:10:25 +0000 [debug]: #0 'logstash_prefix logstash' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'logstash_prefix: logstash' doesn't have tag placeholder
2020-03-12 10:10:25 +0000 [debug]: #0 'deflector_alias ' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'deflector_alias: ' doesn't have tag placeholder
2020-03-12 10:10:25 +0000 [debug]: #0 'application_name default' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'application_name: default' doesn't have tag placeholder
2020-03-12 10:10:25 +0000 [debug]: #0 Need substitution: false
2020-03-12 10:10:25 +0000 [debug]: #0 'host_placeholder es.example.com' is tested built-in placeholder(s) but there is no valid placeholder(s). error: Parameter 'host_placeholder: es.example.com' doesn't have tag placeholder
2020-03-12 10:10:25 +0000 [info]: #0 Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
2020-03-12 10:10:25 +0000 [info]: adding source type="prometheus"
2020-03-12 10:10:25 +0000 [info]: adding source type="prometheus_monitor"
2020-03-12 10:10:25 +0000 [info]: adding source type="prometheus_output_monitor"
2020-03-12 10:10:25 +0000 [info]: adding source type="prometheus_tail_monitor"
2020-03-12 10:10:25 +0000 [info]: adding source type="tail"
2020-03-12 10:10:25 +0000 [info]: adding source type="tail"
2020-03-12 10:10:25 +0000 [warn]: #0 define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2020-03-12 10:10:25 +0000 [info]: #0 starting fluentd worker pid=19 ppid=7 worker=0
2020-03-12 10:10:25 +0000 [debug]: #0 buffer started instance=69827688780720 stage_size=0 queue_size=0
2020-03-12 10:10:25 +0000 [debug]: #0 flush_thread actually running
2020-03-12 10:10:25 +0000 [debug]: #0 flush_thread actually running
2020-03-12 10:10:25 +0000 [debug]: #0 enqueue_thread actually running
2020-03-12 10:10:25 +0000 [info]: #0 following tail of /var/lib/rancher/rke/log/nginx-proxy_a6783e971fdd9e463a46e23309659b0accffd579eef65e0b1e2c20dd6cdf4c43.log
2020-03-12 10:10:25 +0000 [info]: #0 following tail of /var/lib/rancher/rke/log/kubelet_cc46029a026f355d90a144eaaf5d06515cf57fdb248cec32a622b7fc5da11e31.log
2020-03-12 10:10:25 +0000 [debug]: #0 Created new chunk chunk_id="5a0a58fc7a3f0721e875ea7672d05741" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="rancher-rke.var.lib.rancher.rke.log.kubelet_cc46029a026f355d90a144eaaf5d06515cf57fdb248cec32a622b7fc5da11e31.log", variables=nil>
  • paste result of fluent-gem list, td-agent-gem list or your Gemfile.lock
  • ES version: 6.3.2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions