Memory Spike and lots of k8s_tagger processor errors

Hey team,

I saw a sudden raise in memory and GC not reducing the memory consumptions in my pod.
image - sumologic-otel-collector:0.85.0-sumo-0

```
% kubectl version        
Client Version: v1.28.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.16-eks-2d98532
```

The pod memory goes beyond 90% of defined limit and did not come back normal. After I manually restarted the pod, the memory went fine.

Upon checking the logs, we found that memory usage went above hard limit but after gc, it did not come down.
Also found a couple of error logs stating triggered by k8s_tagger processor
```
2023-09-27T03:58:32.023Z	error	kube/client.go:199	object received was not of type api_v1.Pod	 {
    "kind": "processor",
    "name": "k8s_tagger",
    "pipeline": "logs/otlp/containers",
    "received": {
        "Key": "adventtransplant/xx",
        "Obj": {
            "metadata": {
                "name": "xx",
                "namespace": "adventtransplant",
                "uid": "1c8dd1b2-79a3-46f2-86da-xx",
                "creationTimestamp": null,
                "labels": {
                    "mycompany.io/bu": "dc",
                    "mycompany.io/consumer": "xx",
                    "mycompany.io/created-by": "fn-xx-operator",
                    "mycompany.io/expiry": "9999-01-01",
                    "mycompany.io/fedramp": "true",
                    "mycompany.io/hipaa": "true",
                    "mycompany.io/irap": "true",
                    "mycompany.io/owner": "xx",
                    "mycompany.io/pci": "true",
                    "mycompany.io/stage": "prod",
                    "app": "xx",
                    "fnresources.mycompany.io/application-id": "4c68910e-f1f0-482f-a352-d6ca578da628",
                    "fnresources.mycompany.io/environment-id": "d9d171ca-c63b-4f1d-b87b-e94b6f826812",
                    "fnresources.mycompany.io/inject-mcrouter": "memcached",
                    "fnresources.mycompany.io/inject-mcrouter-exporter": "true",
                    "fnresources.mycompany.io/subscription-id": "89aff85b-cf48-4109-9d76-62036c332bb7",
                    "rollouts-pod-template-hash": "7747cc9cf8"
                },
                "annotations": {
                    "cluster-autoscaler.kubernetes.io/safe-to-evict": "true",
                    "fnresources.mycompany.io/php-apache-config-hash": "421794d60a294800473424de4e20e8xxx",
                    "kubernetes.io/psp": "eks.privileged",
                    "vpc.amazonaws.com/pod-ips": "99.11.11.111"
                },
                "ownerReferences": [
                    {
                        "apiVersion": "apps/v1",
                        "kind": "ReplicaSet",
                        "name": "xx-7747cc9cf8",
                        "uid": "b3cf6222-911a-48a8-98ca-01b4378dd003",
                        "controller": true,
                        "blockOwnerDeletion": true
                    }
                ]
            },
            "spec": {
                "containers": null,
                "nodeName": "ip-xx-26-152-xxx.ec2.internal"
            },
            "status": {
                "podIP": "99.11.11.111",
                "startTime": "2023-09-27T02:35:32Z"
            }
        }
    }
}

```

We used the default values
```
k8s_tagger:
        extract:
          annotations:
          - key: '*'
            tag_name: pod_annotations_%s
          delimiter: _
          labels:
          - key: '*'
            tag_name: pod_labels_%s
          metadata:
          - daemonSetName
          - deploymentName
          - hostName
          - namespace
          - nodeName
          - podName
          - serviceName
          - statefulSetName
          namespace_labels:
          - key: '*'
            tag_name: namespace_labels_%s
        owner_lookup_enabled: true
        passthrough: false
        pod_association:
        - from: build_hostname

memory_limiter:
        check_interval: 5s
        limit_percentage: 75
        spike_limit_percentage: 20
```


I also see that in restarted pods previous logs from multiple clusters the memory went above hard limit but did not come down after gc and lots of that type errors

```
2023-09-29T10:58:46.777Z	warn	memorylimiterprocessor@v0.85.0/memorylimiter.go:276	Memory usage is above hard limit. Forcing a GC.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6324}
2023-09-29T10:58:58.857Z	info	memorylimiterprocessor@v0.85.0/memorylimiter.go:266	Memory usage after GC.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6315}
2023-09-29T10:58:58.857Z	warn	memorylimiterprocessor@v0.85.0/memorylimiter.go:276	Memory usage is above hard limit. Forcing a GC.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6315}
2023-09-29T10:59:16.271Z	info	memorylimiterprocessor@v0.85.0/memorylimiter.go:266	Memory usage after GC.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6458}
2023-09-29T10:59:16.271Z	warn	memorylimiterprocessor@v0.85.0/memorylimiter.go:276	Memory usage is above hard limit. Forcing a GC.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6458}
2023-09-29T10:59:29.058Z	info	memorylimiterprocessor@v0.85.0/memorylimiter.go:266	Memory usage after GC.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs/otlp/auditlogs", "cur_mem_mib": 6339}

```
Lots of these logs are found and it remains elevated often.

We also see increased error count in log collector pod which forwards data to this memory bound otel pod. 
```
"2023-09-27T05:50:30.628Z	error	exporterhelper/queued_retry.go:319	Exporting failed. The error is not retryable. Dropping data.	{\"kind\": \"exporter\", \"data_type\": \"logs\", \"name\": \"otlphttp\", \"error\": \"Permanent error: error exporting items, request to http://sumologic-logging-metadata-logs.xx.svc.cluster.local.:0011/v1/logs responded with HTTP Status Code 500, Message=data refused due to high memory usage; data refused due to high memory usage; data refused due to high memory usage, Details=[]\", \"dropped_items\": 546}go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send	
```


I want to know the following
- Why am i getting that type error?
- Did the memory spike happened due to that error and caused the GC to not work as expected?
- Why don’t OTEL resend data to other OTEL pods?

Expected behavior:
- The type error should not happen
- Memory usage after gc should be less than before gc
- No logs should be dropped in case of memory error and instead should be directed to other pod


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Spike and lots of k8s_tagger processor errors #1267

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory Spike and lots of k8s_tagger processor errors #1267

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions