Skip to content

Conversation

@petruanica
Copy link
Contributor

@petruanica petruanica commented Feb 26, 2025

Description of the issue

  • Add support for monitoring EC2 UltraServers using CloudWatch Agent

Description of changes

  • Allowlisted UltraServer dimension for Neuron metrics
  • Added new (ClusterName, UltraServer) dimension for Neuron metrics emitted at the node level

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  • Updated test files to include the new UltraServer dimension
  • Verified metric output format includes the new UltraServer identifier by deploying changes in EKS test cluster

EMF sample:

{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "UltraServer"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "InstanceType",
                    "NeuronCore",
                    "NeuronDevice",
                    "NodeName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_neuroncore_memory_usage_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_runtime_memory",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_constants",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_model_code",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_model_shared_scratchpad",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_tensors",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "HyperPodNeuronEFA",
    "InstanceId": "i-07c72013a77a6826e",
    "InstanceType": "trn1.2xlarge",
    "NeuronCore": "core1",
    "NeuronDevice": "device0",
    "NodeName": "ip-192-168-94-194.us-west-2.compute.internal",
    "Timestamp": "1740591416636",
    "Type": "NodeAWSNeuronCore",
    "UltraServer": "u-1234567890",
    "Version": "0",
    "availability_zone": "us-west-2b",
    "kubernetes": {
        "host": "ip-192-168-94-194.us-west-2.compute.internal"
    },
    "region": "us-west-2",
    "subnet_id": "subnet-0dfa65d0c9792b6f3",
    "node_neuroncore_memory_usage_constants": 0,
    "node_neuroncore_memory_usage_model_code": 0,
    "node_neuroncore_memory_usage_model_shared_scratchpad": 0,
    "node_neuroncore_memory_usage_runtime_memory": 0,
    "node_neuroncore_memory_usage_tensors": 0,
    "node_neuroncore_memory_usage_total": 0,
    "node_neuroncore_utilization": 0
}

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@petruanica petruanica marked this pull request as ready for review March 3, 2025 16:24
@petruanica petruanica requested a review from a team as a code owner March 3, 2025 16:24
movence
movence previously approved these changes Mar 10, 2025
sky333999
sky333999 previously approved these changes Mar 11, 2025
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Mar 27, 2025
@petruanica petruanica dismissed stale reviews from movence and sky333999 via 3326e3f July 3, 2025 10:27
@github-actions github-actions bot removed the Stale label Jul 6, 2025
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Jul 13, 2025
@sky333999 sky333999 merged commit 4c3550e into aws:main Aug 14, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants