-
Notifications
You must be signed in to change notification settings - Fork 730
Description
Describe the Bug
The ModelWatcher in lib/llm/src/discovery/watcher.rs correctly filters Added events by namespace, but does not apply the same filtering to Removed events. This causes spurious "Missing ModelDeploymentCard" errors when multiple Dynamo deployments (with different namespaces) share the same discovery backend.
Steps to Reproduce
- Deploy two separate Dynamo deployments in the same cluster with different namespaces:
- Deployment A: namespace
ns-a - Deployment B: namespace
ns-b
- Deployment A: namespace
- Both deployments share the same discovery backend (etcd/NATS)
- Add a model to Deployment B
- Remove/restart a worker in Deployment B
- Observe Deployment A's frontend logs
Expected Behavior
Deployment A's frontend should silently ignore Removed events for models from ns-b since it never added them to its registry (due to namespace filtering on Added events).
Actual Behavior
Deployment A's frontend logs errors:
ERROR dynamo_llm::discovery::watcher: error removing model error=Missing ModelEntry for models/cc9cc3b1-9e8d-4779-80a3-8a24812911d3
ERROR dynamo_llm::discovery::watcher: error removing model error=Missing ModelEntry for models/38ec0a68-2334-4c38-b682-b34c98b85f03
Environment
- ai-dynamo Version: reproducible on both 0.5.1 and latest main branch (commit
046229f2f) - Deployment: Kubernetes with multiple DGD
- Discovery backend: Shared etcd/NATS across namespaces
Additional Context
Potential issue by eyeballing dynamo code:
In lib/llm/src/discovery/watcher.rs:
Added events (lines 146-157) - Has namespace filtering:
if !global_namespace
&& let Some(target_ns) = target_namespace
&& endpoint_id.namespace != target_ns
{
tracing::debug!("Skipping model from different namespace");
continue; // ← Correctly filtered
}Removed events (lines 203-220) - NO namespace filtering:
DiscoveryEvent::Removed(instance_id) => {
let key = format!("{:x}", instance_id);
// ← No namespace check, goes straight to handle_delete
match self.handle_delete(&key, ...).await { ... }
}The Removed event only contains instance_id (not namespace), so the watcher cannot filter it. When handle_delete is called, it fails because the model was never added to this watcher's registry.
So maybe just change handle_delete to gracefully handle missing keys:
let card = match self.manager.remove_model_card(key) {
Some(card) => card,
None => {
// Model was from a different namespace, silently ignore
return Ok(None);
}
};Or check if key exists before calling handle_delete.
Screenshots
No response