Remove readiness probe otel-agent sidecars #1791

tiffanny29631 · 2025-07-11T23:38:30Z

Remove the readiness probe and health check from otel-agent containers in reconciler, reconciler-manager, and resourcegroup controller to align container behavior. The healthcheck component can fail to bind to the port or respond under CPU throttling, causing unnecessary pod unready states even when the container is running. The otel-collector health check in the config-management-monitoring namespace is retained since it is tied to a Service.

google-oss-prow · 2025-07-11T23:38:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tiffanny29631. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Summary of Changes

Hello @tiffanny29631, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the otel-agent deployment by implementing Kubernetes startupProbe definitions. This change provides a dedicated grace period for the agent to fully initialize, mitigating issues where slow startup times could lead to unnecessary pod restarts and improve overall stability of the service.

Highlights

Kubernetes Probe Configuration: Introduced startupProbe configurations for the otel-agent across multiple Kubernetes manifest files. This new probe specifically targets port 13133 with an httpGet check, configured to allow up to 5 minutes for the agent to initialize before standard readiness and liveness checks begin.
Reliability Improvement: The addition of the startupProbe aims to prevent premature restarts and CrashLoopBackOff errors for the otel-agent, especially in environments with heavy load or resource constraints where the agent might take longer to bind to its health check port.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a startupProbe to the otel-agent container for improved startup reliability. To improve maintainability, I've suggested centralizing the startupProbe configuration to avoid duplication across multiple manifest files.

gemini-code-assist · 2025-07-11T23:39:57Z

manifests/templates/reconciler-manager-configmap.yaml

+           startupProbe:
+             httpGet:
+               path: /
+               port: 13133
+             failureThreshold: 30   # Allow up to 5 minutes (30 * 10s)
+             periodSeconds: 10
+             initialDelaySeconds: 0


Consider defining the startupProbe configuration in a central location (e.g., a common YAML file or a Go struct) and referencing it across these manifests to avoid duplication and improve maintainability. If not feasible, consider adding a comment referencing this file as the source of truth to mitigate inconsistent updates.

gemini-code-assist · 2025-07-11T23:39:57Z

manifests/templates/reconciler-manager.yaml

+        startupProbe:
+          httpGet:
+            path: /
+            port: 13133
+          failureThreshold: 30   # Allow up to 5 minutes (30 * 10s)
+          periodSeconds: 10
+          initialDelaySeconds: 0


Consider defining the startupProbe configuration in a central location (e.g., a common YAML file or a Go struct) and referencing it across these manifests to avoid duplication and improve maintainability. If not feasible, consider adding a comment referencing this file as the source of truth to mitigate inconsistent updates.

gemini-code-assist · 2025-07-11T23:39:57Z

manifests/templates/resourcegroup-manifest.yaml

+        startupProbe:
+          httpGet:
+            path: /
+            port: 13133
+          failureThreshold: 30   # Allow up to 5 minutes (30 * 10s)
+          periodSeconds: 10
+          initialDelaySeconds: 0


Consider defining the startupProbe configuration in a central location (e.g., a common YAML file or a Go struct) and referencing it across these manifests to avoid duplication and improve maintainability. If not feasible, consider adding a comment referencing this file as the source of truth to mitigate inconsistent updates.

sdowell · 2025-07-14T16:54:48Z

manifests/templates/reconciler-manager-configmap.yaml

@@ -195,6 +195,13 @@ data:
           volumeMounts:
           - name: otel-agent-config-reconciler-vol
             mountPath: /conf
+           startupProbe:


Could you clarify what the intent is with this change? The commit message mentions liveness probe, however there is no liveness probe defined for these containers (since 2023).

Also I expect the readinessProbe on these containers may be superfluous as none of these Deployments are exposed through a Service (except for otel-collector Deployment)

Updated in http://b/419026380#comment3. The Readiness probe for otel-agent side car can also be removed to align with the config of other containers. The readiness failure of telemetry containers is not necessary, and could block the reconciler and resource group controller pods.

The readiness failure of telemetry containers is not necessary, and could block the reconciler and resource group controller pods.

Could you explain why the readiness probe blocks the reconciler and resource-group-controller Pods?

The PR has been updated to removal of the readiness probe of otel-agent side car in reconciler / reconciler-manager / resource group controller.

The PR has been updated to removal of the readiness probe of otel-agent side car in reconciler / reconciler-manager / resource group controller.

The motivation and intent for this change is still unclear. Can you explain the why of this change?

Ack, can you include the justification in the commit message and PR descriptions?

causing operational issues

Can you clarify what are the operational issues?

"pod not ready" alarms

Can you clarify which alarms this is referring to?

The change seems reasonable but I'm a bit surprised by the amount of churn. I'd perhaps consider timeboxing this effort given it doesn't solve the underlying issue of the otel-agent being unresponsive

Updated, also limited the change to only remove the readinessProbe for consistent test result.

Updated, also limited the change to only remove the readinessProbe for consistent test result.

Could you elaborate a bit more on this? What was the cause of the prior test failures and why are the tests passing now?

If the readinessProbe is removed, I would expect the health check endpoint/port to be removed as well. Why is the health check endpoint retained?

The test failure was inconsistent and flaky between runs. You can see the test history in the PR email thread.

Previously even after completely removing the readiness configuration, some tests still showed the resource-group-controller and reconciler-manager pods with readiness settings. This caused failures because the port/health_check extension was already removed.

When I ran the same tests on my local kind cluster, this issue didn't happen, so I can't identify the root cause yet. Suggest we could do this in two steps - first remove the readiness probe, then remove the port & extension infrastructure.

Wait I might have reproduced it, allow some more time to figure out the rootcause.

tiffanny29631 · 2025-07-17T17:35:50Z

/restest-required

Copilot

Pull Request Overview

This PR removes readiness probes and health check configurations from otel-agent sidecar containers across multiple controllers to prevent unnecessary pod unready states caused by health check port binding failures or CPU throttling issues.

Removes health check extension configuration from OpenTelemetry collector configs
Removes readiness probes and port 13133 from container specifications
Updates test expectations to reflect the removal of health check endpoints

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
test/kustomization/expected.yaml	Updates test expectations to remove health check configurations and readiness probes
pkg/reconcilermanager/controllers/reconciler_base_test.go	Removes liveness/readiness probe configurations from test cases and updates expected patch strings
manifests/templates/resourcegroup-manifest.yaml	Removes health check extension and readiness probe from resourcegroup controller otel-agent
manifests/templates/reconciler-manager.yaml	Removes health check port and readiness probe from reconciler-manager otel-agent
manifests/templates/reconciler-manager-configmap.yaml	Removes health check port and readiness probe from reconciler template
manifests/otel-agent-reconciler-cm.yaml	Removes health check extension configuration from reconciler otel-agent config
manifests/otel-agent-cm.yaml	Removes health check extension configuration from base otel-agent config

tiffanny29631 · 2025-07-28T23:52:31Z

/retest-required

tiffanny29631 · 2025-07-29T17:26:56Z

/retest-required

tiffanny29631 · 2025-07-29T18:34:21Z

/retest-required

tiffanny29631 · 2025-07-30T00:01:31Z

manifests/templates/reconciler-manager.yaml

@@ -91,10 +91,6 @@ spec:
        volumeMounts:
        - name: otel-agent-config-vol
          mountPath: /conf
-        readinessProbe:


Odd. I removed the readinessProbe from both the reconciler-manager and resource-group-controller configurations, but when checking the pod description for the failed tests, the readiness probe is still showing up. Also there are two resource-group-controller pods running, with one not being ready due to readiness failure, which seems unusual.

Same issue does not apply to the test root-reconciler, suspecting caching issue.

The readiness probe on the otel-agent container was causing operational issues: - False alarms during slow cluster startup due to health check binding failures - Inconsistent with other containers (git-sync, reconciler) which don't use readiness probes - Redundant for a telemetry sidecar that doesn't provide direct user-facing services

…-cli test

google-oss-prow · 2025-08-10T01:55:31Z

@tiffanny29631: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
kpt-config-sync-presubmit	`5bb86cc`	link	true	`/test kpt-config-sync-presubmit`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

google-oss-prow bot requested review from Camila-B and sdowell July 11, 2025 23:38

google-oss-prow bot added the size/S label Jul 11, 2025

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

mikebz requested a review from Copilot July 12, 2025 16:40

This comment was marked as outdated.

Sign in to view

sdowell reviewed Jul 14, 2025

View reviewed changes

tiffanny29631 force-pushed the otel-agent-resource branch from 94c55d8 to 1fe01c7 Compare July 16, 2025 21:42

tiffanny29631 changed the title ~~Add startupProbe to otel-agent for reliable startup~~ Remove readiness probe and health check from otel-agent sidecars Jul 16, 2025

tiffanny29631 force-pushed the otel-agent-resource branch from 1fe01c7 to 650e0f8 Compare July 16, 2025 21:52

google-oss-prow bot added size/M and removed size/S labels Jul 16, 2025

tiffanny29631 force-pushed the otel-agent-resource branch from ab789e4 to 063cf7a Compare July 21, 2025 23:20

mikebz requested a review from Copilot July 21, 2025 23:24

Copilot AI reviewed Jul 21, 2025

View reviewed changes

tiffanny29631 force-pushed the otel-agent-resource branch from 4fe1608 to 554ec98 Compare July 29, 2025 21:50

tiffanny29631 commented Jul 30, 2025

View reviewed changes

tiffanny29631 force-pushed the otel-agent-resource branch from 554ec98 to 063cf7a Compare July 30, 2025 06:15

tiffanny29631 force-pushed the otel-agent-resource branch from 929eba4 to 7c97762 Compare August 6, 2025 22:36

google-oss-prow bot added size/S and removed size/M labels Aug 6, 2025

tiffanny29631 changed the title ~~Remove readiness probe and health check from otel-agent sidecars~~ Remove readiness probe otel-agent sidecars Aug 7, 2025

temp

fa70ab0

google-oss-prow bot added size/M and removed size/S labels Aug 9, 2025

remove healthcheck from otel-agent

83dbb3e

tiffanny29631 force-pushed the otel-agent-resource branch from 3fcb94d to 83dbb3e Compare August 9, 2025 01:36

Correctly restore reconcilermanager and resourcegroupmanager in nomos…

5bb86cc

…-cli test

Remove readiness probe otel-agent sidecars #1791

Are you sure you want to change the base?

Remove readiness probe otel-agent sidecars #1791

Uh oh!

Conversation

tiffanny29631 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tiffanny29631 commented Jul 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

tiffanny29631 commented Jul 28, 2025

Uh oh!

tiffanny29631 commented Jul 29, 2025

Uh oh!

tiffanny29631 commented Jul 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tiffanny29631 commented Jul 11, 2025 •

edited

Loading

google-oss-prow bot commented Aug 10, 2025 •

edited

Loading