-
Notifications
You must be signed in to change notification settings - Fork 43
Description
What happened?
When a new ClusterBaseModel is created, the model-agent downloads the same model twice simultaneously. Two gopher tasks (Download and DownloadOverride) are processed within 200ms of each other, and both workers start downloading before either registers in the activeDownloads map, causing a race condition which results in a concatenated misformated config.json model file among other issues.
Evidence from logs:
2025-10-30T14:35:22.511Z INFO gopher.go:928 Downloading HuggingFace model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8
2025-10-30T14:35:22.690Z INFO gopher.go:243 Processing gopher task: type: DownloadOverride
2025-10-30T14:35:22.700Z INFO gopher.go:928 Downloading HuggingFace model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8
Both downloads complete, duplicating all 24 safetensor shards.
What did you expect to happen?
Only one download should occur. The second task should detect the in-progress download and skip.
How can we reproduce it (as minimally and precisely as possible)?
- Deploy OME with
--num-download-worker=2or more - Create a new ClusterBaseModel:
apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
name: qwen3-vl-235b-a22b-thinking-fp8
spec:
displayName: qwen.qwen3-vl-235b-a22b-thinking-fp8
vendor: qwen
disabled: false
version: "1.0.0"
storage:
storageUri: hf://Qwen/Qwen3-VL-235B-A22B-Thinking-FP8
path: /mnt/data/models/qwen/qwen3-vl-235b-a22b-thinking-fp8- Check model-agent logs - two concurrent downloads start ~200ms apart
Anything else we need to know?
Root cause: Race condition in pkg/modelagent/gopher.go:276-279. The check and registration of activeDownloads[modelUID] are not atomic:
// Worker 1 and 2 both reach here before either registers
ctx, cancel = context.WithCancel(context.Background())
s.activeDownloadsMutex.Lock()
s.activeDownloads[modelUID] = cancel // No check before registration
s.activeDownloadsMutex.Unlock()Proposed fix: Atomic check-and-set inside the mutex:
s.activeDownloadsMutex.Lock()
if _, isDownloading := s.activeDownloads[modelUID]; isDownloading {
s.activeDownloadsMutex.Unlock()
s.logger.Infof("Model %s already downloading, skipping", modelInfo)
return nil
}
ctx, cancel = context.WithCancel(context.Background())
s.activeDownloads[modelUID] = cancel
s.activeDownloadsMutex.Unlock()Workaround: Set --num-download-worker=1 to serialize downloads via patch. Currently helm deamonset.yaml value is hardcoded to 2.
Environment
- OME version: v0.1.3
- Kubernetes version (use
kubectl version): v1.33.1 - Cloud provider or hardware configuration: Nebul
- OS (e.g., from
/etc/os-release): n/a - Runtime (SGLang, vLLM, etc.) and version: n/a
- Model being served (if applicable): n/a
- Install method (Helm, kubectl, etc.):} Helm