Skip to content

[BUG]: Continuous "failed to join writer task: I/O error: Broken pipe" observed #4844

@zxue2

Description

@zxue2

Describe the Bug

Image built based on the latest main branch we can see continuous "failed to join writer task: I/O error: Broken pipe" immediately post each "Health check successful for generate". Could you please check if it is relevant to commit "fix: Fix warning messages with healthcheck (#4793)" ? Or other possible cause?

I recall no such issue before 12/8.

2025-12-09T14:36:33.805975Z  INFO dynamo_runtime::health_check: Spawned health check task for endpoint: generate
2025-12-09T14:36:33.805985Z  INFO dynamo_runtime::health_check: Health check task started for: generate
2025-12-09T14:36:33.805992Z  INFO dynamo_runtime::component::endpoint: Registering endpoint with request plane server endpoint=generate transport="nats"
2025-12-09T14:36:33.806021Z  INFO dynamo_runtime::pipeline::network::ingress::nats_server: NatsMultiplexedServer::register_endpoint called endpoint_name=generate namespace=vllm-agg component=backend instance_id=7042392880458507523
2025-12-09T14:36:33.806033Z  INFO dynamo_runtime::pipeline::network::ingress::nats_server: Successfully retrieved service group
2025-12-09T14:36:33.806048Z  INFO dynamo_runtime::pipeline::network::ingress::nats_server: Registering NATS endpoint endpoint_name=generate endpoint_with_id=generate-61bb9b0338782503 namespace=vllm-agg component=backend instance_id=7042392880458507523
2025-12-09T14:36:33.806053Z  INFO dynamo_runtime::pipeline::network::ingress::nats_server: Starting NATS push endpoint listener (blocking) endpoint_name=generate endpoint_with_id=generate-61bb9b0338782503
2025-12-09T14:36:43.807605Z  INFO dynamo_runtime::health_check: Canary timer expired for generate, sending health check
2025-12-09T14:36:43.808823Z ERROR dynamo_runtime::health_check: Health check request failed for generate: instance_id=7042392880458507523 not found for endpoint vllm-agg/backend/generate
2025-12-09T14:36:46.430145Z ERROR http-request: tower_http::trace::on_failure: response failed classification=Status code: 503 Service Unavailable latency=0 ms method=GET uri=/live version=HTTP/1.1
2025-12-09T14:36:51.431332Z ERROR http-request: tower_http::trace::on_failure: response failed classification=Status code: 503 Service Unavailable latency=0 ms method=GET uri=/health version=HTTP/1.1
2025-12-09T14:36:51.693577Z  WARN logger._print_warning_once: cudagraph dispatching keys are not initialized. No cudagraph will be used.
2025-12-09T14:36:53.809261Z  INFO dynamo_runtime::health_check: Canary timer expired for generate, sending health check
2025-12-09T14:36:53.867328Z  INFO dynamo_runtime::health_check: Health check successful for generate
2025-12-09T14:36:53.867586Z ERROR dynamo_runtime::pipeline::network::tcp::client: failed to join writer task: I/O error: Broken pipe (os error 32)

Caused by:
    Broken pipe (os error 32)
2025-12-09T14:36:53.868410Z ERROR handle_payload: dynamo_runtime::pipeline::network::ingress::push_handler: Failed to publish response for stream 1b124693-2650-450f-9436-2bdb340d5646
2025-12-09T14:37:03.811901Z  INFO dynamo_runtime::health_check: Canary timer expired for generate, sending health check
2025-12-09T14:37:03.859990Z  INFO dynamo_runtime::health_check: Health check successful for generate
2025-12-09T14:37:03.860235Z ERROR dynamo_runtime::pipeline::network::tcp::client: failed to join writer task: I/O error: Broken pipe (os error 32)

Caused by:
    Broken pipe (os error 32)

Steps to Reproduce

  1. Build image based on main and XPU device
  2. Issue occurs post print "Starting NATS push endpoint listener (blocking) endpoint_name=generate endpoint_with_id=generate-61bb9b0338782503" on decoder worker

Expected Behavior

Decoder worker can work normally

Actual Behavior

Decoder worker keeps continuous restarting

dynamo-system vllm-agg-vllmdecodeworker-7784b5dcbb-lpfl2 1/1 Running 7 (4m21s ago) 18m

Environment

main branch + intel GPU

Additional Context

No response

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions