A lightweight machine learning training job simulator designed for testing observability systems, OpenTelemetry sidecars, and metrics collection pipelines.
This simulator generates realistic ML training metrics (loss, accuracy, GPU utilization, etc.) and writes them to a shared volume in JSON format. It's ideal for:
- Testing OpenTelemetry sidecar implementations
- Validating metrics collection and export pipelines
- Developing ML observability tools without running actual training jobs
- Load testing monitoring systems with realistic ML workload patterns
# Build the image
docker build -t mock-ml-job .
# Run with default configuration
docker run --rm mock-ml-job
# Run with custom configuration and volume mount
docker run --rm \
-e TOTAL_EPOCHS=5 \
-e MODEL_NAME=bert-base \
-e WRITE_INTERVAL=5 \
-v $(pwd)/metrics:/shared/metrics \
mock-ml-job
# Pull the latest image
docker pull ghcr.io/openteams-ai/mock-ml-job:latest
# Run it
docker run --rm ghcr.io/openteams-ai/mock-ml-job:latest
# Install dependencies
pip install -r requirements.txt
# Run the simulator
python training_simulator.py
Configure the simulator using environment variables:
Variable | Description | Default |
---|---|---|
JOB_ID |
Unique identifier for this training job | training-job-001 |
MODEL_NAME |
Name of the model being trained | resnet-50 |
DATASET |
Dataset name | imagenet |
METRICS_FILE_PATH |
Path to write metrics JSON | /shared/metrics/current.json |
WRITE_INTERVAL |
Seconds between metric updates | 10 |
TOTAL_EPOCHS |
Number of training epochs to simulate | 10 |
BATCHES_PER_EPOCH |
Batches per epoch | 100 |
The simulator produces realistic training metrics with natural progression and variance:
- training_loss: Decreases from ~2.5 to ~0.15 over training
- validation_loss: Decreases slower, stays slightly higher than training loss
- accuracy: Increases from ~15% to ~94%
- learning_rate: Exponential decay (0.95^epoch)
- gpu_utilization: Stable around 92% with realistic variance
- processing_time_ms: ~245ms per batch with variance
- samples_per_second: Calculated throughput metric
The metrics are written to a JSON file with the following structure:
{
"timestamp": "2025-10-23T10:30:45.123456+00:00",
"job_metadata": {
"job_id": "training-job-001",
"model_name": "resnet-50",
"dataset": "imagenet",
"start_time": "2025-10-23T10:15:00.000000+00:00"
},
"training_metrics": {
"epoch": 3,
"batch_number": 45,
"training_loss": 1.2345,
"validation_loss": 1.3456,
"accuracy": 0.6789,
"learning_rate": 0.000857375,
"gpu_utilization": 0.923,
"processing_time_ms": 238,
"samples_per_second": 134.45
}
}
The simulator is designed to run as the main container in a pod with an observability sidecar:
apiVersion: v1
kind: Pod
metadata:
name: ml-training-with-sidecar
spec:
containers:
- name: training-simulator
image: ghcr.io/openteams-ai/mock-ml-job:latest
env:
- name: TOTAL_EPOCHS
value: "20"
- name: WRITE_INTERVAL
value: "5"
volumeMounts:
- name: metrics
mountPath: /shared/metrics
- name: metrics-exporter-sidecar
image: your-sidecar-image:latest
volumeMounts:
- name: metrics
mountPath: /shared/metrics
readOnly: true
volumes:
- name: metrics
emptyDir: {}
The simulator uses an atomic write pattern (write to .tmp
file, then rename) to ensure sidecars never read partial or corrupted JSON data.
# Fast iteration testing (short epochs, quick updates)
docker run --rm \
-e TOTAL_EPOCHS=3 \
-e BATCHES_PER_EPOCH=20 \
-e WRITE_INTERVAL=2 \
mock-ml-job
# Long-running training simulation
docker run --rm \
-e TOTAL_EPOCHS=100 \
-e BATCHES_PER_EPOCH=500 \
-e WRITE_INTERVAL=30 \
mock-ml-job
# Custom model and dataset metadata
docker run --rm \
-e MODEL_NAME=transformer-xl \
-e DATASET=wikitext-103 \
-e JOB_ID=exp-2025-001 \
mock-ml-job
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -r requirements-dev.txt
# Or use make
make install-dev
The project includes a Makefile for common development tasks:
# Show all available commands
make help
# Run all quality checks (as CI does)
make ci
# Individual commands
make lint # Run linting
make typecheck # Run type checking
make test-unit # Run unit tests
make docker-build # Build Docker image
make test-integration # Run integration tests
make clean # Remove generated files
# Run unit tests with coverage
pytest test_training_simulator.py
# Run tests with verbose output
pytest test_training_simulator.py -v
# Run specific test
pytest test_training_simulator.py::TestTrainingSimulator::test_calculate_metrics_structure
# Generate coverage report
pytest test_training_simulator.py --cov=training_simulator --cov-report=html
# Open htmlcov/index.html in browser
Integration tests build and run the actual Docker container to verify it behaves correctly:
# Build the Docker image first
docker build -t mock-ml-job:test .
# Run integration tests (requires Docker)
pytest test_integration.py -v
# Run all tests (unit + integration)
pytest -v
Note: Integration tests require Docker to be installed and running.
# Run linting
ruff check .
# Auto-fix linting issues
ruff check --fix .
# Run type checking
mypy training_simulator.py test_training_simulator.py
# Run all checks (like CI does)
ruff check . && mypy training_simulator.py test_training_simulator.py && pytest
mock-ml-job/
├── training_simulator.py # Main simulator code
├── test_training_simulator.py # Unit tests (100% coverage)
├── test_integration.py # Integration tests (Docker container)
├── requirements.txt # Runtime dependencies
├── requirements-dev.txt # Development dependencies
├── pyproject.toml # Tool configuration (pytest, mypy, ruff)
├── Dockerfile # Container definition
└── .github/workflows/
└── ci.yml # CI/CD: tests, linting, type-checking, Docker build & push
The project uses GitHub Actions with a 4-stage pipeline:
Stage 1: Test (runs on every commit and PR)
- Linting with ruff
- Type checking with mypy
- Unit tests with pytest on Python 3.9, 3.10, 3.11, 3.12
- Coverage reporting (100% required)
Stage 2: Build (after tests pass)
- Builds Docker image (linux/amd64)
- Saves image as artifact for next stage
Stage 3: Integration Test (using built image)
- Loads Docker image from previous stage
- Runs 8 integration tests validating container behavior
- Tests metrics generation, configuration, signal handling, etc.
Stage 4: Push (only on tagged releases like v1.0.0
)
- Builds multi-architecture image (linux/amd64, linux/arm64)
- Pushes to GitHub Container Registry
- Supports both x86 and ARM64 (Apple Silicon M-series Macs)
- Tags:
latest
,v1.0.0
,v1.0
,v1
This pipeline ensures every commit is fully validated, and only tagged releases produce Docker images.
See CLAUDE.md for additional development guidance.
Apache License 2.0 - See LICENSE file for details.