Mock ML Job Simulator

A lightweight machine learning training job simulator designed for testing observability systems, OpenTelemetry sidecars, and metrics collection pipelines.

Overview

This simulator generates realistic ML training metrics (loss, accuracy, GPU utilization, etc.) and writes them to a shared volume in JSON format. It's ideal for:

Testing OpenTelemetry sidecar implementations
Validating metrics collection and export pipelines
Developing ML observability tools without running actual training jobs
Load testing monitoring systems with realistic ML workload patterns

Quick Start

Using Docker

# Build the image
docker build -t mock-ml-job .

# Run with default configuration
docker run --rm mock-ml-job

# Run with custom configuration and volume mount
docker run --rm \
  -e TOTAL_EPOCHS=5 \
  -e MODEL_NAME=bert-base \
  -e WRITE_INTERVAL=5 \
  -v $(pwd)/metrics:/shared/metrics \
  mock-ml-job

Using pre-built image from GitHub Container Registry

# Pull the latest image
docker pull ghcr.io/openteams-ai/mock-ml-job:latest

# Run it
docker run --rm ghcr.io/openteams-ai/mock-ml-job:latest

Local Development

# Install dependencies
pip install -r requirements.txt

# Run the simulator
python training_simulator.py

Configuration

Configure the simulator using environment variables:

Variable	Description	Default
`JOB_ID`	Unique identifier for this training job	`training-job-001`
`MODEL_NAME`	Name of the model being trained	`resnet-50`
`DATASET`	Dataset name	`imagenet`
`METRICS_FILE_PATH`	Path to write metrics JSON	`/shared/metrics/current.json`
`WRITE_INTERVAL`	Seconds between metric updates	`10`
`TOTAL_EPOCHS`	Number of training epochs to simulate	`10`
`BATCHES_PER_EPOCH`	Batches per epoch	`100`

Generated Metrics

The simulator produces realistic training metrics with natural progression and variance:

training_loss: Decreases from ~2.5 to ~0.15 over training
validation_loss: Decreases slower, stays slightly higher than training loss
accuracy: Increases from ~15% to ~94%
learning_rate: Exponential decay (0.95^epoch)
gpu_utilization: Stable around 92% with realistic variance
processing_time_ms: ~245ms per batch with variance
samples_per_second: Calculated throughput metric

Metrics File Format

The metrics are written to a JSON file with the following structure:

{
  "timestamp": "2025-10-23T10:30:45.123456+00:00",
  "job_metadata": {
    "job_id": "training-job-001",
    "model_name": "resnet-50",
    "dataset": "imagenet",
    "start_time": "2025-10-23T10:15:00.000000+00:00"
  },
  "training_metrics": {
    "epoch": 3,
    "batch_number": 45,
    "training_loss": 1.2345,
    "validation_loss": 1.3456,
    "accuracy": 0.6789,
    "learning_rate": 0.000857375,
    "gpu_utilization": 0.923,
    "processing_time_ms": 238,
    "samples_per_second": 134.45
  }
}

Integration with Observability Systems

Sidecar Pattern

The simulator is designed to run as the main container in a pod with an observability sidecar:

apiVersion: v1
kind: Pod
metadata:
  name: ml-training-with-sidecar
spec:
  containers:
  - name: training-simulator
    image: ghcr.io/openteams-ai/mock-ml-job:latest
    env:
    - name: TOTAL_EPOCHS
      value: "20"
    - name: WRITE_INTERVAL
      value: "5"
    volumeMounts:
    - name: metrics
      mountPath: /shared/metrics

  - name: metrics-exporter-sidecar
    image: your-sidecar-image:latest
    volumeMounts:
    - name: metrics
      mountPath: /shared/metrics
      readOnly: true

  volumes:
  - name: metrics
    emptyDir: {}

Atomic Writes

The simulator uses an atomic write pattern (write to .tmp file, then rename) to ensure sidecars never read partial or corrupted JSON data.

Example Use Cases

Testing Different Training Scenarios

# Fast iteration testing (short epochs, quick updates)
docker run --rm \
  -e TOTAL_EPOCHS=3 \
  -e BATCHES_PER_EPOCH=20 \
  -e WRITE_INTERVAL=2 \
  mock-ml-job

# Long-running training simulation
docker run --rm \
  -e TOTAL_EPOCHS=100 \
  -e BATCHES_PER_EPOCH=500 \
  -e WRITE_INTERVAL=30 \
  mock-ml-job

# Custom model and dataset metadata
docker run --rm \
  -e MODEL_NAME=transformer-xl \
  -e DATASET=wikitext-103 \
  -e JOB_ID=exp-2025-001 \
  mock-ml-job

Development

Setup Development Environment

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -r requirements-dev.txt
# Or use make
make install-dev

Using Make

The project includes a Makefile for common development tasks:

# Show all available commands
make help

# Run all quality checks (as CI does)
make ci

# Individual commands
make lint              # Run linting
make typecheck         # Run type checking
make test-unit         # Run unit tests
make docker-build      # Build Docker image
make test-integration  # Run integration tests
make clean             # Remove generated files

Running Tests

Unit Tests

# Run unit tests with coverage
pytest test_training_simulator.py

# Run tests with verbose output
pytest test_training_simulator.py -v

# Run specific test
pytest test_training_simulator.py::TestTrainingSimulator::test_calculate_metrics_structure

# Generate coverage report
pytest test_training_simulator.py --cov=training_simulator --cov-report=html
# Open htmlcov/index.html in browser

Integration Tests

Integration tests build and run the actual Docker container to verify it behaves correctly:

# Build the Docker image first
docker build -t mock-ml-job:test .

# Run integration tests (requires Docker)
pytest test_integration.py -v

# Run all tests (unit + integration)
pytest -v

Note: Integration tests require Docker to be installed and running.

Code Quality

# Run linting
ruff check .

# Auto-fix linting issues
ruff check --fix .

# Run type checking
mypy training_simulator.py test_training_simulator.py

# Run all checks (like CI does)
ruff check . && mypy training_simulator.py test_training_simulator.py && pytest

Project Structure

mock-ml-job/
├── training_simulator.py      # Main simulator code
├── test_training_simulator.py # Unit tests (100% coverage)
├── test_integration.py        # Integration tests (Docker container)
├── requirements.txt            # Runtime dependencies
├── requirements-dev.txt        # Development dependencies
├── pyproject.toml             # Tool configuration (pytest, mypy, ruff)
├── Dockerfile                 # Container definition
└── .github/workflows/
    └── ci.yml                 # CI/CD: tests, linting, type-checking, Docker build & push

Continuous Integration

The project uses GitHub Actions with a 4-stage pipeline:

Stage 1: Test (runs on every commit and PR)

Linting with ruff
Type checking with mypy
Unit tests with pytest on Python 3.9, 3.10, 3.11, 3.12
Coverage reporting (100% required)

Stage 2: Build (after tests pass)

Builds Docker image (linux/amd64)
Saves image as artifact for next stage

Stage 3: Integration Test (using built image)

Loads Docker image from previous stage
Runs 8 integration tests validating container behavior
Tests metrics generation, configuration, signal handling, etc.

Stage 4: Push (only on tagged releases like v1.0.0)

Builds multi-architecture image (linux/amd64, linux/arm64)
Pushes to GitHub Container Registry
Supports both x86 and ARM64 (Apple Silicon M-series Macs)
Tags: latest, v1.0.0, v1.0, v1

This pipeline ensures every commit is fully validated, and only tagged releases produce Docker images.

See CLAUDE.md for additional development guidance.

License

Apache License 2.0 - See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mock ML Job Simulator

Overview

Quick Start

Using Docker

Using pre-built image from GitHub Container Registry

Local Development

Configuration

Generated Metrics

Metrics File Format

Integration with Observability Systems

Sidecar Pattern

Atomic Writes

Example Use Cases

Testing Different Training Scenarios

Development

Setup Development Environment

Using Make

Running Tests

Unit Tests

Integration Tests

Code Quality

Project Structure

Continuous Integration

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
test_integration.py		test_integration.py
test_training_simulator.py		test_training_simulator.py
training_simulator.py		training_simulator.py

License

openteams-ai/mock-ml-job

Folders and files

Latest commit

History

Repository files navigation

Mock ML Job Simulator

Overview

Quick Start

Using Docker

Using pre-built image from GitHub Container Registry

Local Development

Configuration

Generated Metrics

Metrics File Format

Integration with Observability Systems

Sidecar Pattern

Atomic Writes

Example Use Cases

Testing Different Training Scenarios

Development

Setup Development Environment

Using Make

Running Tests

Unit Tests

Integration Tests

Code Quality

Project Structure

Continuous Integration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages