Skip to content

Conversation

@MengjinYan
Copy link
Contributor

@MengjinYan MengjinYan commented Nov 21, 2025

Description

Recently, we realized that during actor restarts, we will generate duplicate task events for the same actor creation task id and attempt 0. This is because, the actor restart process happens inside GCS and the corresponding actor creation task TaskSpec won't be updated with the new attempt number.

This PR adds the logic to update the attempt number to the actor creation task TaskSpec and add a test to verify the change.

Related issues

N/A

Additional information

N/A

@MengjinYan
Copy link
Contributor Author

Another issue we found during the investigation is that, because part of the actor creation process is done on GCS, the task events won't be sent for some of the task state transitions (SUBMITTED_TO_WORKER and PENDING_NODE_ASSIGNMENT for restart). I'm planning to create a followup PR to address the issue to expose the missing task events for the actor creation task.

@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Nov 21, 2025
@MengjinYan MengjinYan marked this pull request as ready for review November 21, 2025 21:14
@MengjinYan MengjinYan requested a review from a team as a code owner November 21, 2025 21:14
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Nov 22, 2025
assert event["task_definition_event"]["task_attempt"] == 0
assert (
event["task_definition_event"]["language"] == "PYTHON"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to just make the expected event and do event["task_definition_event"] = expected_event to make it a little cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Because event fields like event_id is not deterministic.

I think one solution I can try is to extract all the deterministic fields and compare them with generated expected_event with those fields. Do you think it will be better?


# Add a second node to the cluster for the actor to be restarted on
cluster.add_node(num_cpus=2)
cluster.wait_for_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_node has wait: bool = True anyways


driver_task_definition_received = False
actor_creation_task_definition_received = False
for event in events:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the order of events deterministic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really as well. The order of the events from the same worker process should be deterministic but not across processes. And events from different processes can be interleaved with each other.

Comment on lines 1706 to 1721
script = """
import ray
import time
ray.init()
@ray.remote(num_cpus=2, max_restarts=-1, max_task_retries=-1)
class Actor:
def __init__(self):
pass
def actor_task(self):
pass
actor = Actor.remote()
time.sleep(999) # Keep the actor alive
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to use textwrap.dedent to make this readable.

script, httpserver, ray_start_cluster_head_with_env_vars, validate_events
)

@_cluster_with_aggregator_target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at the fixture while reviewing this and I think the env_var_prefix makes it hard to find RAY_DASHBOARD_AGGREGATOR_AGENT_PRESERVE_PROTO_FIELD_NAME config. This is one of the few cases in which duplication is probably better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Discussed offline. We should fix that in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MengjinYan
Copy link
Contributor Author

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

  1. After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.
  2. With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.

Signed-off-by: Mengjin Yan <[email protected]>
@MengjinYan
Copy link
Contributor Author

While working on the test of the comment of persisting the actor task spec, I realized that 2 addition issues:

  1. After actor restarts, the retry of the actor creation task doesn't show up when listing the tasks.
  2. With GCS fault tolerance, the actor cannot be killed by actor.kill() because the local Raylet address is not persisted/restored properly.

I'll investigate the issues and add the fix to this PR (if it is required by the test case) or come up with followup PRs on them.

With the investigation,

  1. The task doesn't showup because the events for the actor creation task retry doesn't contain the TaskInfo and the list_tasks API just filtered the entry from the task table. And the missing of the TaskInfo is mainly due to missing task events for the actor creation task scheduling process in GCS. The following up PR of adding more events will fix the issue.
  2. This issue can be workaround in test by killing the actor process directly. We should further investigate and fix the issue.

For this PR, I fixed the comment to persist the actor creation task spec after updating the attempt_number.

I also tested locally with the following test case:

def test_actor_creation_task_retry_with_gcs_restart(ray_start_regular_with_external_redis):
    @ray.remote(max_restarts=-1, max_task_retries=-1)
    class Actor:
        def __init__(self):
            pass

        def get_pid(self):
            return os.getpid()
    
    # Create an actor and make sure it is alive.
    actor = Actor.remote()
    pid_1 = ray.get(actor.get_pid.remote())

    # Kill the actor and checks the actor creation task is retried.
    actor_process_1 = psutil.Process(pid_1)
    actor_process_1.kill()
    actor_process_1.wait(timeout=10)
    pid_2 = ray.get(actor.get_pid.remote())
    assert pid_2 != pid_1

    # Kill and restart the GCS
    ray._private.worker._global_node.kill_gcs_server()
    ray._private.worker._global_node.start_gcs_server()

    # Check the actor works with the new GCS
    pid_3 = ray.get(actor.get_pid.remote())
    assert pid_3 == pid_2

    # Kill the actor and checks the actor creation task retry number is 2
    # ray.kill(actor, no_restart=False)
    # wait_for_condition(lambda: list_actors()[0].state == "RESTARTING")
    actor_process_2 = psutil.Process(pid_2)
    actor_process_2.kill()
    actor_process_2.wait(timeout=10)
    pid_4 = ray.get(actor.get_pid.remote())
    assert pid_4 != pid_2
    
    tasks = ray.util.state.list_tasks()
    print(tasks)

This is not added to the PR because the list task issue mentioned above needs to be fixed first. But with the local test, I verified by emitting logs about the raw task events and verified that, after the GCS restart, the restart of the actor will have attempt_number == 2 which is +1 of the attempt_number before the GCS restart. This means that attempt_number persists correctly.

[2025-11-24 12:39:53,998 I 98546 26382886] (gcs_server) gcs_task_manager.cc:479: [myan] Task event: {"taskId":"4NwXTINZkDTv+D02fZFno4ilfd4BAAAA","a
ttemptNumber":2,"taskInfo":{"type":"ACTOR_TASK","name":"Actor.get_pid","funcOrClassName":"Actor.get_pid","jobId":"AQAAAA==","taskId":"4NwXTINZkDTv+
D02fZFno4ilfd4BAAAA","parentTaskId":"//////////////////////////8BAAAA","requiredResources":{"CPU":1},"runtimeEnvInfo":{"serializedRuntimeEnv":"{}",
"runtimeEnvConfig":{"setupTimeoutSeconds":600,"eagerInstall":true}},"actorId":"7/g9Nn2RZ6OIpX3eAQAAAA=="},"stateUpdates":{"nodeId":"+o2EmR4Mt6foTRq
1pwJeYijQMhIvWLut2ebvEQ==","workerId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","workerPid":98607,"stateTsNs":{"1":"1764016791934715000","5":"1764
016792935842000","11":"1764016793027231000","7":"1764016792937619000","2":"1764016792935806000","8":"1764016792937639000"}},"profileEvents":{"compo
nentType":"worker","componentId":"Sw5BFvkiYyOqJLWSsF2gLe5IvGd0zKyy0nsPrA==","nodeIpAddress":"127.0.0.1","events":[{"startTime":"1764016792938575000
","endTime":"1764016792938579000","extraData":"{}","eventName":"task:deserialize_arguments"},{"startTime":"1764016792938593000","endTime":"17640167
92938636000","extraData":"{}","eventName":"task:execute"},{"startTime":"1764016792938641000","endTime":"1764016793026486000","extraData":"{}","even
tName":"task:store_outputs"},{"startTime":"1764016792937790000","endTime":"1764016793026920000","extraData":"{\"name\": \"get_pid\", \"task_id\": \
"e0dc174c83599034eff83d367d9167a388a57dde01000000\"}","eventName":"task::Actor.get_pid"}]},"jobId":"AQAAAA=="}```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants