[Core] Update the attempt number of the actor creation task when actor restart #58877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

MengjinYan wants to merge 5 commits into ray-project:master from MengjinYan:core-2546

+349 −94

Contributor

MengjinYan commented Nov 21, 2025 •

edited

Loading

Description

Recently, we realized that during actor restarts, we will generate duplicate task events for the same actor creation task id and attempt 0. This is because, the actor restart process happens inside GCS and the corresponding actor creation task TaskSpec won't be updated with the new attempt number.

This PR adds the logic to update the attempt number to the actor creation task TaskSpec and add a test to verify the change.

Related issues

N/A

Additional information

N/A

MengjinYan added 2 commits

November 20, 2025 23:07


          update the attempt number of the actor creation task when actor restart

62e0482

Signed-off-by: Mengjin Yan <[email protected]>


          Merge branch 'master' into core-2546

65024d8

Contributor Author

MengjinYan commented Nov 21, 2025

Another issue we found during the investigation is that, because part of the actor creation process is done on GCS, the task events won't be sent for some of the task state transitions (SUBMITTED_TO_WORKER and PENDING_NODE_ASSIGNMENT for restart). I'm planning to create a followup PR to address the issue to expose the missing task events for the actor creation task.


          Merge branch 'master' into core-2546

76e2e7e

MengjinYan added the go label

MengjinYan marked this pull request as ready for review

November 21, 2025 21:14

MengjinYan requested a review from a team as a code owner

November 21, 2025 21:14

cursor bot reviewed

View reviewed changes

src/ray/gcs/gcs_actor_manager.cc Outdated Show resolved Hide resolved

ray-gardener bot added core observability labels

dayshah reviewed

View reviewed changes

python/ray/tests/test_ray_event_export_task_events.py

    
                                          assert event["task_definition_event"]["task_attempt"] == 0

                                          assert (

                                              event["task_definition_event"]["language"] == "PYTHON"

                                          )

Contributor

dayshah Nov 22, 2025

is it possible to just make the expected event and do event["task_definition_event"] = expected_event to make it a little cleaner

Contributor Author

MengjinYan Nov 24, 2025

Not really. Because event fields like event_id is not deterministic.

I think one solution I can try is to extract all the deterministic fields and compare them with generated expected_event with those fields. Do you think it will be better?

python/ray/tests/test_ray_event_export_task_events.py Outdated

    
                      # Add a second node to the cluster for the actor to be restarted on

                      cluster.add_node(num_cpus=2)

                      cluster.wait_for_nodes()

Contributor

dayshah Nov 22, 2025

add_node has wait: bool = True anyways

python/ray/tests/test_ray_event_export_task_events.py

    
                          driver_task_definition_received = False

                          actor_creation_task_definition_received = False

                          for event in events:

Contributor

dayshah Nov 22, 2025

is the order of events deterministic?

Member

Yicheng-Lu-llll Nov 24, 2025

Got it, thanks!

Contributor Author

MengjinYan Nov 24, 2025

Not really as well. The order of the events from the same worker process should be deterministic but not across processes. And events from different processes can be interleaved with each other.

israbbani reviewed

View reviewed changes

python/ray/tests/test_ray_event_export_task_events.py Outdated

Comment on lines 1706 to 1721

    
                      script = """

              import ray

              import time

              ray.init()

              @ray.remote(num_cpus=2, max_restarts=-1, max_task_retries=-1)

              class Actor:

                  def __init__(self):

                      pass

                  def actor_task(self):

                      pass

              actor = Actor.remote()

              time.sleep(999) # Keep the actor alive

                      """

Contributor

israbbani Nov 24, 2025

I like to use textwrap.dedent to make this readable.

python/ray/tests/test_ray_event_export_task_events.py

    
                          script, httpserver, ray_start_cluster_head_with_env_vars, validate_events

                      )

                  @_cluster_with_aggregator_target

Contributor

israbbani Nov 24, 2025

I'm looking at the fixture while reviewing this and I think the env_var_prefix makes it hard to find RAY_DASHBOARD_AGGREGATOR_AGENT_PRESERVE_PROTO_FIELD_NAME config. This is one of the few cases in which duplication is probably better.

Contributor Author

MengjinYan Nov 24, 2025

Good point. Discussed offline. We should fix that in a separate PR.

Contributor

israbbani Nov 24, 2025

Contributor Author

MengjinYan commented Nov 24, 2025

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.
With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.


          fix comments

ebe8254

Signed-off-by: Mengjin Yan <[email protected]>

Contributor Author

MengjinYan commented Nov 25, 2025

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.

With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.

With the investigation,

The task doesn't showup because the events for the actor creation task retry doesn't contain the TaskInfo and the list_tasks API just filtered the entry from the task table. And the missing of the TaskInfo is mainly due to missing task events for the actor creation task scheduling process in GCS. The following up PR of adding more events will fix the issue.
This issue can be workaround in test by killing the actor process directly. We should further investigate and fix the issue.

For this PR, I fixed the comment to persist the actor creation task spec after updating the attempt_number.

I also tested locally with the following test case:

def test_actor_creation_task_retry_with_gcs_restart(ray_start_regular_with_external_redis):
    @ray.remote(max_restarts=-1, max_task_retries=-1)
    class Actor:
        def __init__(self):
            pass

        def get_pid(self):
            return os.getpid()
    
    # Create an actor and make sure it is alive.
    actor = Actor.remote()
    pid_1 = ray.get(actor.get_pid.remote())

    # Kill the actor and checks the actor creation task is retried.
    actor_process_1 = psutil.Process(pid_1)
    actor_process_1.kill()
    actor_process_1.wait(timeout=10)
    pid_2 = ray.get(actor.get_pid.remote())
    assert pid_2 != pid_1

    # Kill and restart the GCS
    ray._private.worker._global_node.kill_gcs_server()
    ray._private.worker._global_node.start_gcs_server()

    # Check the actor works with the new GCS
    pid_3 = ray.get(actor.get_pid.remote())
    assert pid_3 == pid_2

    # Kill the actor and checks the actor creation task retry number is 2
    # ray.kill(actor, no_restart=False)
    # wait_for_condition(lambda: list_actors()[0].state == "RESTARTING")
    actor_process_2 = psutil.Process(pid_2)
    actor_process_2.kill()
    actor_process_2.wait(timeout=10)
    pid_4 = ray.get(actor.get_pid.remote())
    assert pid_4 != pid_2
    
    tasks = ray.util.state.list_tasks()
    print(tasks)

This is not added to the PR because the list task issue mentioned above needs to be fixed first. But with the local test, I verified by emitting logs about the raw task events and verified that, after the GCS restart, the restart of the actor will have attempt_number == 2 which is +1 of the attempt_number before the GCS restart. This means that attempt_number persists correctly.

[2025-11-24 12:39:53,998 I 98546 26382886] (gcs_server) gcs_task_manager.cc:479: [myan] Task event: {"taskId":"4NwXTINZkDTv+D02fZFno4ilfd4BAAAA","a
ttemptNumber":2,"taskInfo":{"type":"ACTOR_TASK","name":"Actor.get_pid","funcOrClassName":"Actor.get_pid","jobId":"AQAAAA==","taskId":"4NwXTINZkDTv+
D02fZFno4ilfd4BAAAA","parentTaskId":"//////////////////////////8BAAAA","requiredResources":{"CPU":1},"runtimeEnvInfo":{"serializedRuntimeEnv":"{}",
"runtimeEnvConfig":{"setupTimeoutSeconds":600,"eagerInstall":true}},"actorId":"7/g9Nn2RZ6OIpX3eAQAAAA=="},"stateUpdates":{"nodeId":"+o2EmR4Mt6foTRq
1pwJeYijQMhIvWLut2ebvEQ==","workerId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","workerPid":98607,"stateTsNs":{"1":"1764016791934715000","5":"1764
016792935842000","11":"1764016793027231000","7":"1764016792937619000","2":"1764016792935806000","8":"1764016792937639000"}},"profileEvents":{"compo
nentType":"worker","componentId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","nodeIpAddress":"127.0.0.1","events":[{"startTime":"1764016792938575000
","endTime":"1764016792938579000","extraData":"{}","eventName":"task:deserialize_arguments"},{"startTime":"1764016792938593000","endTime":"17640167
92938636000","extraData":"{}","eventName":"task:execute"},{"startTime":"1764016792938641000","endTime":"1764016793026486000","extraData":"{}","even
tName":"task:store_outputs"},{"startTime":"1764016792937790000","endTime":"1764016793026920000","extraData":"{\"name\": \"get_pid\", \"task_id\": \
"e0dc174c83599034eff83d367d9167a388a57dde01000000\"}","eventName":"task::Actor.get_pid"}]},"jobId":"AQAAAA=="}```


          Merge branch 'master' into core-2546

65917fc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core go observability