Skip to content

[BUG]: ModelWatcher emits "Missing ModelEntry" errors when multiple namespaces share discovery stream #4801

@resouer

Description

@resouer

Describe the Bug

The ModelWatcher in lib/llm/src/discovery/watcher.rs correctly filters Added events by namespace, but does not apply the same filtering to Removed events. This causes spurious "Missing ModelDeploymentCard" errors when multiple Dynamo deployments (with different namespaces) share the same discovery backend.

Steps to Reproduce

  1. Deploy two separate Dynamo deployments in the same cluster with different namespaces:
    • Deployment A: namespace ns-a
    • Deployment B: namespace ns-b
  2. Both deployments share the same discovery backend (etcd/NATS)
  3. Add a model to Deployment B
  4. Remove/restart a worker in Deployment B
  5. Observe Deployment A's frontend logs

Expected Behavior

Deployment A's frontend should silently ignore Removed events for models from ns-b since it never added them to its registry (due to namespace filtering on Added events).

Actual Behavior

Deployment A's frontend logs errors:

ERROR dynamo_llm::discovery::watcher: error removing model error=Missing ModelEntry for models/cc9cc3b1-9e8d-4779-80a3-8a24812911d3
ERROR dynamo_llm::discovery::watcher: error removing model error=Missing ModelEntry for models/38ec0a68-2334-4c38-b682-b34c98b85f03

Environment

  • ai-dynamo Version: reproducible on both 0.5.1 and latest main branch (commit 046229f2f)
  • Deployment: Kubernetes with multiple DGD
  • Discovery backend: Shared etcd/NATS across namespaces

Additional Context

Potential issue by eyeballing dynamo code:

In lib/llm/src/discovery/watcher.rs:

Added events (lines 146-157) - Has namespace filtering:

if !global_namespace
    && let Some(target_ns) = target_namespace
    && endpoint_id.namespace != target_ns
{
    tracing::debug!("Skipping model from different namespace");
    continue;  // ← Correctly filtered
}

Removed events (lines 203-220) - NO namespace filtering:

DiscoveryEvent::Removed(instance_id) => {
    let key = format!("{:x}", instance_id);
    // ← No namespace check, goes straight to handle_delete
    match self.handle_delete(&key, ...).await { ... }
}

The Removed event only contains instance_id (not namespace), so the watcher cannot filter it. When handle_delete is called, it fails because the model was never added to this watcher's registry.

So maybe just change handle_delete to gracefully handle missing keys:

let card = match self.manager.remove_model_card(key) {
    Some(card) => card,
    None => {
        // Model was from a different namespace, silently ignore
        return Ok(None);
    }
};

Or check if key exists before calling handle_delete.

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions