Skip to content

Conversation

jagan2221
Copy link
Contributor

@jagan2221 jagan2221 commented Jun 20, 2025

  1. Collector pods were crashing in ipv6 env due to invalid ip address format used for exposing endpoints such as health check, liveliness and otlp/http/grpc server endpoints etc.
    We are constructing these endpoints using podIP env variable , need to enclose it with square braces. Added a generic flag to configure this for all endpoints.

  2. Regex used for extracting pod ip and port in pod-annotations scrape config does not support ipv6 parsing.
    Introduced a method to construct scrape address pod ip and prometheus port

Both above changes are behind a feature flag - sumologic.ipv6mode

Detailed summary of the changes: https://docs.google.com/document/d/1aBfne-cN6k9p_Lw3zZ2ZqY8Ho0AHYTzQB-nIpYx--GA
These are the issue faced by a customer when setting up helm chart in ipv6 cluster. These fixes are deployed in customer setup using manual overrides using config merge.
Jira: https://sumologic.atlassian.net/browse/OSC-1043

Checklist

  • Changelog updated or skip changelog label added
  • Documentation updated
  • Template tests added for new features
  • Integration tests added or modified for major features

ipv6 compatibility fixes and UT's for the same
@jagan2221 jagan2221 requested a review from a team as a code owner June 20, 2025 14:30
@rnishtala-sumo
Copy link
Contributor

We need a changelog for this. The approach makes sense to me. Lets ensure that the integration tests pass. It might also make sense to write an integration test for metrics collection.

@jagan2221
Copy link
Contributor Author

jagan2221 commented Jun 23, 2025

https://github.com/SumoLogic/sumologic-kubernetes-collection/actions/runs/15817837079/job/44580109359?pr=3949
Something wrong with Helm_Routing_OT test - it was successful in above run for the same PR, but failing in other runs.

I've checked in other PR's too, this test is failing/passing randomly.

@jagan2221 jagan2221 requested a review from rnishtala-sumo June 23, 2025 14:49
Copy link
Contributor

@rnishtala-sumo rnishtala-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting ITs for this change because of its large footprint.

Copy link
Contributor

@rnishtala-sumo rnishtala-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the setup job to test ipv6 is not preferred. The setup job does more than managing secrets. Instead of implementing this workaround, we should consider adding manual test instructions to our vagrant docs if using dns64 or nat64 is complex.

@jagan2221 jagan2221 requested a review from rnishtala-sumo July 3, 2025 16:06
Copy link
Contributor

@rnishtala-sumo rnishtala-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting that the IT be removed for now, until we have a solution for the setup job. This can live on a branch. Recommend that our vagrant (developer) docs be updated with instructions on testing this.

@jagan2221
Copy link
Contributor Author

Even vagrant test steps will be based on disabling setupJob and test helm chart further. Hope this would be fine for local test steps @rnishtala-sumo

@jagan2221
Copy link
Contributor Author

jagan2221 commented Aug 14, 2025

@rnishtala-sumo @echlebek We have deployed these changes in second customer setup(ChargeCloud) this week in their ipv6 only cluster and the deployment is fine.

From the two customers we have seen, the issue customer asked to solve us is the Sumo pod level issues and not nat64/dns64 Network level issues.

QE started testing as well. We would need this to be merged for QE to test E2E's.

@rnishtala-sumo
Copy link
Contributor

Recommend considering documentation for atleast one cluster type before merging this. A walkthrough of the manual steps before the helm chart is deployed is ideal. Could be reviewed by QE as well. An example can be seen in the doc for EKS Fargate. This could be in a different PR.

@jagan2221
Copy link
Contributor Author

https://docs.google.com/document/d/1ifCHtPsrz9ntTYyigGV0RkOb_9OzVp8j_DmmfqHJ2yM
@rnishtala-sumo Creating a draft doc for the same. Can you review this? Will get it reviewed from QE as well.

@rnishtala-sumo
Copy link
Contributor

rnishtala-sumo commented Aug 20, 2025

@jagan2221 what I meant was a public facing doc on github like this one - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/docs/fargate.md (could call it ipv6.md) that runs the customer through the following steps as pre-requisites before installing the helm chart. It could say something like the following and QE could use the same method to test this for consistency.

In AWS, the way to provide IPv6→IPv4 egress is by setting up an NAT64 + DNS64 environment. The recommended approach is to use AWS NAT64 (via NAT gateway + VPC Route 64) with DNS64 resolver. This allows your IPv6-only pods to reach IPv4 destinations transparently.

To set up a NAT Gateway:
aws ec2 create-nat-gateway \
  --subnet-id <public-subnet-id> \
  --allocation-id <eip-allocation-id> \
  --connectivity-type public

Note NAT gateway only translates IPv4 → IPv4, so far.

Enable DNS64 on your VPC’s AmazonProvidedDNS run:
aws ec2 modify-vpc-attribute \
  --vpc-id <vpc-id> \
  --enable-dns64

For subnets where your IPv6-only pods run, add a default route for the well-known IPv6 NAT64 prefix (64:ff9b::/96) to the NAT Gateway:
aws ec2 create-route \
  --route-table-id <rtb-id> \
  --destination-ipv6-cidr-block 64:ff9b::/96 \
  --nat-gateway-id <nat-gateway-id>

Then run a test from a pod:

spec:
  containers:
    - name: busybox
      image: busybox
      command: ["sh", "-c", "ping ipv4-only-endpoint.com"]

We could then add a release note saying IPv6 is supported on EKS clusters.

@jagan2221
Copy link
Contributor Author

jagan2221 commented Aug 20, 2025

@rnishtala-sumo The draft doc I shared is just the same which would be added as something like ipv6.md .

Also,
We have enough EKS docs on how to configure IPv6 clusters and explaining the default egress ipv6->ipv4 in EKS clusters.

https://docs.aws.amazon.com/eks/latest/userguide/cni-ipv6.html#_ip_address_assignments
https://docs.aws.amazon.com/eks/latest/userguide/deploy-ipv6-cluster.html
Screenshot 2025-08-20 at 11 07 16 PM

Should we point to existing comprehensive AWS docs to customer / should we re-capture things from AWS docs from our side? I think recapturing AWS side things from our side could be a mutating thing and we can't adopt to the changes.
My idea is just to specify the workflow and direct customer to existing detailed vendor specific instead of maintaining those from our end. WDYT?

@rnishtala-sumo
Copy link
Contributor

rnishtala-sumo commented Aug 20, 2025

In situations where we're asking a customer to go through manual steps before installing the helm chart, it helps to be specific. We're asking them to do the following

  • Create a standard AWS NAT Gateway and enable DNS64 translation at the VPC level
  • Add routing for 64:ff9b::/96 to the NAT Gateway.
  • Let EKS pods use the default VPC DNS resolver

Asking them to run specific aws cli commands and a test using a pod ensures that the prerequisites are satisfied before the helm chart is deployed. QE can test using the same steps in the E2E.

@jagan2221
Copy link
Contributor Author

jagan2221 commented Aug 22, 2025

In situations where we're asking a customer to go through manual steps before installing the helm chart, it helps to be specific. We're asking them to do the following

  • Create a standard AWS NAT Gateway and enable DNS64 translation at the VPC level
  • Add routing for 64:ff9b::/96 to the NAT Gateway.
  • Let EKS pods use the default VPC DNS resolver

Asking them to run specific aws cli commands and a test using a pod ensures that the prerequisites are satisfied before the helm chart is deployed. QE can test using the same steps in the E2E.

@rnishtala-sumo
Makes sense.
Created an EKS ipv6 cluster for testing . We don't need a NAT gateway , VPC-CNI plugin does the job. There was no additional things required.

  1. Makes sure VPC and Subnets have both ipv4 and ipv6 CIDR's and subnets needs to "auto-assign IPv6 address for nodes" setting enabled.
  2. Install VPN-CNI plugin which has the ipv6-> ipv4 natting capability inbuilt.

Helm chart in installing properly and running fine with above changes. Will capture the steps to do above steps and share the doc.

@jagan2221
Copy link
Contributor Author

@rnishtala-sumo
#3977
Please review the doc PR.

@jagan2221
Copy link
Contributor Author

@rnishtala-sumo Doc PR is throwing markdown lint errors . Checking that, will merge once resolved. Can you please provide a final approval for this PR?

@jagan2221 jagan2221 force-pushed the j_ipv6_compatibility_fixes branch from 3c3a530 to f575ef7 Compare September 3, 2025 18:38
Copy link
Contributor

@rnishtala-sumo rnishtala-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! lets ensure the new E2E tests run for this feature.

@jagan2221 jagan2221 merged commit 9b4db0f into main Sep 3, 2025
133 of 136 checks passed
@jagan2221 jagan2221 deleted the j_ipv6_compatibility_fixes branch September 3, 2025 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants