Skip to content

bootstrap error for ephemeral drives while using custom built rocky ami and gpu node with instance store #7197

@erpadmin

Description

@erpadmin

Required Info:

  • AWS ParallelCluster version 3.14.0
  • Full cluster configuration without any credentials or personal data.
  • Cluster name:
{
  "creationTime": "2026-01-09T19:34:00.291Z",
  "headNode": {
    "launchTime": "2026-01-13T22:31:58.000Z",
    "instanceId": "i-08bed33afb8304b17",
    "publicIpAddress": "removed",
    "instanceType": "t3.medium",
    "state": "running",
    "privateIpAddress": "172.31.92.166"
  },
  "version": "3.14.0",
  "clusterConfiguration": {
    "url": "https://parallelcluster-a1212dd4272ca60a-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.14.0/clusters/rocky-pcluster-if0aqqa6lvin39j1/configs/cluster-config.yaml?versionId=Ujqxyjl37nQB4EX5pVifLbEGGm09MXHv&AWSAccessKeyId=ASIA3A2KWP626QOBJDYB&Signature=Znr6I8jxOjv%2FCYWrgR56xuoc1Jc%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEEgaCXVzLWVhc3QtMiJHMEUCIB9L%2FBfKkd6sBlymKGVztll5vhuyINpWuiszsq8cSbZ%2BAiEA11xBsMDDH11WZ69BJzFmab0wsHRVXPvoD%2FWqhSEVCugqiAMIERABGgw3NTc2ODE1MjA1NjUiDDeK6WKq7OzJvKyTVSrlAjR6kAgPtbeVUe6K4AORiGqDwLH0W%2FtuOd5XPs0n37u4lp%2FraFoRDqx72maHRjVW9chWWXWkZrLtKVA2urRishs2kGdVitGXLLc0cZIlqmJHEjwy3pkIh%2BQhTFZY1APTVFPNg4PS7tlhX6mgzBRwBAKa%2FONjeFtxKxdbKrho%2FDMOZHpkdI5ehCMxnY7DC0rtbiwmrnc2oo9AENcuATnCovw%2FZsiuD%2FDpSpzgZxXmPRFh4YnZ8RlejwQ2Z947%2FyLwLXXe4UivIsjYVk8NQVL3qVqUue%2BSclToglb07CDSJcXysF5Mw2eteL49j2QGeuAOiO92D3kvF1NYKylZeJtUt%2B4zdwIicJ4neAenOR%2FMd71aJ02NsnnG20rcA9V3tej2y6YGDdcno8TvQgbBY6Kb8SJvQ1BXzuiwgmeAv17mVziUfdhsMFB5%2F9Js8a%2Fu1B7WGPxWENMtr7lxvKSSHkIdXUa4xzDigjD0v5vLBjqkAWUBhzLT7nDnyc0k0WtTjCaOs3g63NcLWtKIosV%2By5%2BkcMpDajZ4XQXAi7QFez3HYrd935EF4C5PbP%2F4t%2FQE%2FMC1ID1wrDv%2BpF482ZNWEbBFwKVw2cDYeQ4h7OhYqQCAxpFZhXEWk09brSeCy2IcZdfqMCEcM%2FJ0ANe6WAFwBmZXY%2Bgeod5I7bzFW2AlhpmaNug9xwaXLQz3lLqKgXiOrBwM7oTQ&Expires=1768354638"
  },
  "tags": [
    {
      "value": "3.14.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "rocky-pcluster",
      "key": "parallelcluster:cluster-name"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_COMPLETE",
  "clusterName": "rocky-pcluster",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1::stack/rocky-pcluster/25aa20c0-ed92-11f0-80dc-0affcf2f8753",
  "lastUpdatedTime": "2026-01-14T00:16:35.277Z",
  "region": "us-east-1",
  "clusterStatus": "UPDATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

Bug description and how to reproduce:
Launch instance through srun call, resulting instance terminates early if its a gpu node with instance store. regular compute nodes launch ok. Instances are small during prototyping and testing.

CloudWatch event

{
    "datetime": "2026-01-14T00:24:23+00:00",
    "version": 0,
    "cluster-name": "rocky-pcluster",
    "scheduler": "slurm",
    "node-role": "ComputeFleet",
    "level": "ERROR",
    "instance-id": "i-0ecb993d5fbf812e5",
    "event-type": "chef-recipe-exception",
    "message": "Chef recipe exception",
    "component": "config",
    "compute": {
        "name": null,
        "instance-id": "i-0ecb993d5fbf812e5",
        "instance-type": "g5.xlarge",
        "availability-zone": "us-east-1a",
        "address": "172.31.84.112",
        "hostname": "ip-172-31-84-112.ec2.internal",
        "queue-name": "gpu",
        "compute-resource": "hpc-gpu-001",
        "node-type": null
    },
    "detail": {
        "failures": [
            {
                "exception-type": "Mixlib::ShellOut::ShellCommandFailed",
                "error-title": "Error executing action `run` on resource 'execute[Setup of ephemeral drives]'",
                "nesting-level": 0,
                "cookbook-name": "aws-parallelcluster-environment",
                "recipe-name": "ephemeral_drives",
                "source-line": "/etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-environment/recipes/config/ephemeral_drives.rb:28:in `from_file'",
                "resource-name": "Setup of ephemeral drives",
                "resource-type": "execute",
                "action": "run"
            }
        ]
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions