Skip to content

[BUG] Duplicate model downloads when creating new ClusterBaseModel #308

@WillemMomma

Description

@WillemMomma

What happened?

When a new ClusterBaseModel is created, the model-agent downloads the same model twice simultaneously. Two gopher tasks (Download and DownloadOverride) are processed within 200ms of each other, and both workers start downloading before either registers in the activeDownloads map, causing a race condition which results in a concatenated misformated config.json model file among other issues.

Evidence from logs:

2025-10-30T14:35:22.511Z  INFO  gopher.go:928  Downloading HuggingFace model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8
2025-10-30T14:35:22.690Z  INFO  gopher.go:243  Processing gopher task: type: DownloadOverride
2025-10-30T14:35:22.700Z  INFO  gopher.go:928  Downloading HuggingFace model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8

Both downloads complete, duplicating all 24 safetensor shards.

What did you expect to happen?

Only one download should occur. The second task should detect the in-progress download and skip.

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy OME with --num-download-worker=2 or more
  2. Create a new ClusterBaseModel:
apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: qwen3-vl-235b-a22b-thinking-fp8
spec:
  displayName: qwen.qwen3-vl-235b-a22b-thinking-fp8
  vendor: qwen
  disabled: false
  version: "1.0.0"
  storage:
    storageUri: hf://Qwen/Qwen3-VL-235B-A22B-Thinking-FP8
    path: /mnt/data/models/qwen/qwen3-vl-235b-a22b-thinking-fp8
  1. Check model-agent logs - two concurrent downloads start ~200ms apart

Anything else we need to know?

Root cause: Race condition in pkg/modelagent/gopher.go:276-279. The check and registration of activeDownloads[modelUID] are not atomic:

// Worker 1 and 2 both reach here before either registers
ctx, cancel = context.WithCancel(context.Background())
s.activeDownloadsMutex.Lock()
s.activeDownloads[modelUID] = cancel  // No check before registration
s.activeDownloadsMutex.Unlock()

Proposed fix: Atomic check-and-set inside the mutex:

s.activeDownloadsMutex.Lock()
if _, isDownloading := s.activeDownloads[modelUID]; isDownloading {
    s.activeDownloadsMutex.Unlock()
    s.logger.Infof("Model %s already downloading, skipping", modelInfo)
    return nil
}
ctx, cancel = context.WithCancel(context.Background())
s.activeDownloads[modelUID] = cancel
s.activeDownloadsMutex.Unlock()

Workaround: Set --num-download-worker=1 to serialize downloads via patch. Currently helm deamonset.yaml value is hardcoded to 2.

Environment

  • OME version: v0.1.3
  • Kubernetes version (use kubectl version): v1.33.1
  • Cloud provider or hardware configuration: Nebul
  • OS (e.g., from /etc/os-release): n/a
  • Runtime (SGLang, vLLM, etc.) and version: n/a
  • Model being served (if applicable): n/a
  • Install method (Helm, kubectl, etc.):} Helm

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions