eth-easl
diff --git a/‎tools/simulator/AGENTS.md‎
Lines changed: 33 additions & 0 deletions b/‎tools/simulator/AGENTS.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎tools/simulator/CLAUDE.md‎
Lines changed: 66 additions & 15 deletions b/‎tools/simulator/CLAUDE.md‎
Lines changed: 66 additions & 15 deletions
diff --git a/‎tools/simulator/README.md‎
Lines changed: 4 additions & 0 deletions b/‎tools/simulator/README.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎tools/simulator/__init__.py‎ b/‎tools/simulator/__init__.py‎
diff --git a/‎tools/simulator/cli/run_simulator.py‎
Lines changed: 137 additions & 0 deletions b/‎tools/simulator/cli/run_simulator.py‎
Lines changed: 137 additions & 0 deletions
diff --git a/‎tools/simulator/cli/start_simulator.py‎
Lines changed: 0 additions & 70 deletions b/‎tools/simulator/cli/start_simulator.py‎
Lines changed: 0 additions & 70 deletions
@@ -0,0 +1,33 @@
+# Repository Guidelines
+
+## Project Structure & Module Organization
+- `core/` holds the scheduling engines and memory planners; start with `core/global_engine.py` or `core/node_global_engine.py` when altering execution flow.
+- `cli/` provides runnable entry points such as `run_simulator.py` (simulation driver) and `plot_roofline.py` (visualization helper).
+- `api/` and `utils/` supply service adapters and shared helpers—trace loading, hardware math, serializers—so prefer adding reusable logic there instead of duplicating it.
+- `internal/` packages the analyzer toolkit plus canonical hardware configs; treat these files as reference data.
+- `examples/` stores sample traces and environment JSON/JSONL fixtures for smoke tests; generated outputs belong in `.local/` and stay untracked.
+
+## Build, Test, and Development Commands
+- `python -m venv .venv && source .venv/bin/activate` to isolate dependencies.
+- `python -m pip install -r requirements.txt` installs runtime packages (`humanize`, `transformers`).
+- `python cli/run_simulator.py --input examples/trace.jsonl --n-engines 2 --arrival-rate 1.5 --trace-output .local/trace.json --stats-output .local/stats.json` runs the canonical workload and surfaces performance statistics.
+- `python cli/plot_roofline.py --input .local/stats.json --out plots/roofline.png` turns stats into an image; create the destination directory first.
+
+## Coding Style & Naming Conventions
+- Target Python 3.10+, four-space indentation, and PEP 8 naming (snake_case for modules/functions, UpperCase for enums like `REQ_STATUS`).
+- Use type hints and concise docstrings similar to `core/request.py:GenerationRequest` to clarify intent.
+- Group imports stdlib → third-party → local, and expose public symbols explicitly in `__init__.py` files when it improves discoverability.
+
+## Testing Guidelines
+- No automated suite exists yet; replay `cli/run_simulator.py` with `examples` fixtures and inspect `.local/stats.json` for regressions after each change.
+- New tests should rely on `pytest` under `tests/` mirroring the module structure (e.g., `tests/core/test_memory_planner.py`) with descriptive names like `test_allocates_kv_cache`.
+- Capture before/after throughput or latency figures when altering performance-sensitive code and share them in the review thread.
+
+## Commit & Pull Request Guidelines
+- Follow the history style: imperative subjects (`Add README for LLM Simulator`) and optional issue references in parentheses (e.g., `(#60)`).
+- Keep commits focused and avoid checking in artifacts from `.local/` or large trace files.
+- Pull requests need a short motivation, the commands you ran (build/test), and links or screenshots for visualization changes.
+
+## Security & Configuration Tips
+- Review `internal/configs/hardware_params.py` and `examples/env.json` before adding hardware profiles; never commit production-specific credentials.
+- Treat environment-change JSONL fixtures as append-only—add new files for new scenarios instead of rewriting shared samples.
@@ -9,18 +9,36 @@ This is an LLM inference simulator that models the performance of Large Language
 ## Key Commands
 
 ### Running Simulations
+
+#### Node-based Architecture (Recommended)
 ```bash
-# Run the main simulator with default parameters
-python cli/start_simulator.py --input <trace_file> --n-engines <num_engines> --arrival-rate <rate>
+# Run with node-based environment configuration
+python cli/run_simulator.py --input <trace_file> --environment examples/env.json --arrival-rate <rate>
+
+# Example with node-based configuration
+python cli/run_simulator.py --input examples/trace.jsonl --environment examples/env.json --arrival-rate 1.0
 
-# Example with typical parameters
-python cli/start_simulator.py --input trace.json --n-engines 4 --arrival-rate 1.0
+# Example with environment changes (dynamic GPU provisioning)
+python cli/run_simulator.py --input examples/trace.jsonl --environment examples/env.json --environment-change-file examples/env_changes.jsonl --arrival-rate 1.0
 
-# Output files are generated in .local/replay_results/ by default:
-# - trace.json: Chrome trace format events for visualization
-# - stats.json: Request statistics and performance metrics
+# Limit number of requests for testing
+python cli/run_simulator.py --input examples/trace.jsonl --environment examples/env.json --arrival-rate 1.0 --limit 100
 ```
 
+#### Legacy Engine-based Architecture
+```bash
+# Run with legacy engine configuration (backward compatibility)
+python cli/run_simulator.py --input <trace_file> --n-engines <num_engines> --arrival-rate <rate>
+
+# Example with legacy configuration
+python cli/run_simulator.py --input trace.json --n-engines 4 --arrival-rate 1.0
+```
+
+#### Output Files
+Output files are generated in `.local/replay_results/` by default:
+- `trace.json`: Chrome trace format events for visualization
+- `stats.json`: Request statistics and performance metrics
+
 ### Roofline Analysis
 ```bash
 # Generate roofline plots for different hardware
@@ -40,15 +58,28 @@ pip install -r requirements.txt
 
 ### Core Components
 
-1. **LLMGlobalEngine** (`core/global_engine.py`): Central orchestrator that manages multiple LLM engines, handles request scheduling, and tracks global simulation state.
+#### Node-based Architecture (New)
+1. **NodeGlobalEngine** (`core/node_global_engine.py`): Central orchestrator that manages multiple compute nodes, handles request scheduling, and supports dynamic node re-provisioning.
+
+2. **ComputeNode** (`core/node.py`): Represents a physical server with multiple GPUs, managing resource allocation, model loading, and request scheduling across the GPUs in the node.
+
+3. **NodeMemoryPlanner** (`core/node.py`): Manages memory allocation across multiple GPUs in a node, considering both GPU memory and node-level constraints.
 
-2. **LLMEngine** (`core/engine.py`): Individual inference engine that processes requests through prefill and decode phases, manages memory allocation, and generates trace events.
+4. **Node Routing Policies** (`core/policies/routing/node_based.py`): Determines how requests are assigned to nodes. Includes random, least-loaded, round-robin, and best-fit policies.
 
-3. **GenerationRequest** (`core/request.py`): Represents a single inference request with metadata like input/output lengths, arrival time, and current status.
+5. **Node Re-provisioning Policies** (`core/policies/node_reprovisioning/`): Handles dynamic re-provisioning of nodes between different models when no suitable nodes are available.
 
-4. **ModelAnalyzer** (`internal/analyzer/model_analyzer.py`): Performs roofline analysis to estimate inference times based on hardware parameters and model configurations.
+#### Legacy Engine-based Architecture
+6. **LLMGlobalEngine** (`core/global_engine.py`): Central orchestrator that manages multiple LLM engines, handles request scheduling, and tracks global simulation state.
 
-5. **Routing Policies** (`core/policies/`): Determines how requests are assigned to engines. Currently implements random routing, with extensible base class for other policies.
+7. **LLMEngine** (`core/engine.py`): Individual inference engine that processes requests through prefill and decode phases, manages memory allocation, and generates trace events.
+
+#### Shared Components
+8. **GenerationRequest** (`core/request.py`): Represents a single inference request with metadata like input/output lengths, arrival time, and current status.
+
+9. **ModelAnalyzer** (`internal/analyzer/model_analyzer.py`): Performs roofline analysis to estimate inference times based on hardware parameters and model configurations.
+
+10. **Environment Configuration** (`core/env.py`): Supports both node-based and legacy GPU-based environment configurations with infrastructure constraints.
 
 ### Key Data Flow
 
@@ -107,9 +138,29 @@ The roofline analysis calculates:
 
 ## Testing and Validation
 
+### Example Files
+The simulator includes example configuration files in the `examples/` directory:
+- `env.json`: Node-based environment configuration with A100 and H100 clusters
+- `trace.jsonl`: Sample request trace for testing
+- `env_changes.jsonl`: Example dynamic environment changes for testing re-provisioning
+
+### Example Commands
+```bash
+# Test with node-based configuration (10 requests)
+python cli/run_simulator.py --input examples/trace.jsonl --environment examples/env.json --arrival-rate 1.0 --limit 10
+
+# Test with dynamic environment changes
+python cli/run_simulator.py --input examples/trace.jsonl --environment examples/env.json --environment-change-file examples/env_changes.jsonl --arrival-rate 0.5
+
+# Test legacy engine mode
+python cli/run_simulator.py --input examples/trace.jsonl --n-engines 2 --arrival-rate 1.0 --limit 10
+```
+
+### Output Analysis
 The simulator outputs:
-- Chrome trace format files for performance visualization
+- Chrome trace format files for performance visualization (load into `chrome://tracing`)
 - JSON statistics with latency, throughput, and queue metrics
-- SLO pass rates for multi-stage request processing
+- Node-level utilization and re-provisioning events
+- Model loading and unloading trace events
 
-Typical validation involves comparing simulated latencies against real hardware measurements for known models and hardware configurations.
+Typical validation involves comparing simulated latencies against real hardware measurements for known models and hardware configurations.
@@ -2,6 +2,10 @@ The LLM Simulator is a comprehensive performance modeling and simulation tool de
 
 The simulator consists of several key components that work together to model the complete lifecycle of LLM inference requests:
 
+## Contributor Guide
+
+See [`AGENTS.md`](AGENTS.md) for repository guidelines covering project layout, development workflows, testing expectations, and pull request conventions.
+
 - **Request Modeling**: Simulates incoming generation requests with configurable arrival patterns
 - **Engine Simulation**: Models LLM inference engines with prefill and decode phases
 - **Performance Analysis**: Provides roofline analysis and hardware-specific performance metrics
 
@@ -0,0 +1,137 @@
+import os
+import json
+from dataclasses import asdict
+from rich.console import Console
+from core.global_engine import LLMGlobalEngine
+from core.node_global_engine import NodeGlobalEngine
+from utils.loader import load_trace
+from core.env import load_environment_config, load_environment_changes
+
+console = Console()
+
+
+def run_simulation(args):
+    print(args)
+    workload = load_trace(
+        args.input,
+        float(args.arrival_rate),
+    )
+    if args.limit > 0:
+        workload = workload[: args.limit]
+
+    # Load environment configuration
+    environment_config = None
+    if hasattr(args, "environment") and args.environment:
+        environment_config = load_environment_config(args.environment)
+
+    # Load environment changes if provided
+    environment_changes = None
+    if hasattr(args, "environment_change_file") and args.environment_change_file:
+        environment_changes = load_environment_changes(args.environment_change_file)
+
+    # Choose engine type based on whether environment config is provided
+    if environment_config:
+        # Use NodeGlobalEngine when environment config is provided
+        print("Using Node-based Global Engine")
+        server = NodeGlobalEngine(
+            environment_config=environment_config,
+            environment_changes=environment_changes,
+            print_interval=args.print_interval,
+        )
+    else:
+        # Fallback to legacy LLMGlobalEngine for backward compatibility
+        print("Using Legacy Engine-based Global Engine")
+        server = LLMGlobalEngine(
+            environment_config=environment_config,
+            environment_changes=environment_changes,
+            print_interval=args.print_interval,
+        )
+
+        # If no environment config is provided, use the old method
+        for _ in range(args.n_engines):
+            server.add_engine(
+                "meta-llama/Meta-Llama-3-70B-Instruct", "nvidia_A100", 4, 4, 4
+            )
+
+    server.load_requests(workload)
+    print(f"--" * 10 + " Simulation Started " + "--" * 10)
+    server.start()
+
+    # Collect stats (works for both legacy and node-based engines)
+    if hasattr(server, "requests_stats"):
+        summary = server.requests_stats
+    else:
+        summary = []
+
+    if hasattr(server, "failed_requests"):
+        failed = server.failed_requests
+    else:
+        failed = []
+
+    if hasattr(server, "config"):
+        config = server.config
+    else:
+        config = {"engine_type": "node_based" if environment_config else "legacy"}
+
+    stats = {
+        "summary": summary,
+        "failed": failed,
+        "config": config,
+    }
+    os.makedirs(os.path.dirname(args.trace_output), exist_ok=True)
+    os.makedirs(os.path.dirname(args.stats_output), exist_ok=True)
+    with open(args.trace_output, "w") as f:
+        data = {"traceEvents": [asdict(x) for x in server.trace]}
+        f.write(json.dumps(data, indent=4))
+    with open(args.stats_output, "w") as f:
+        f.write(json.dumps(stats, indent=4))
+
+    print(end="\n")
+    print(f"--" * 10 + " Simulation Done " + "--" * 10)
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input", type=str, help="Input file")
+    parser.add_argument("--n-engines", type=int, help="Number of engines")
+    parser.add_argument("--arrival-rate", help="Arrival rate", default=None)
+    parser.add_argument(
+        "--trace-output",
+        type=str,
+        help="Trace file",
+        default=".local/replay_results/trace.json",
+    )
+    parser.add_argument(
+        "--stats-output",
+        type=str,
+        help="Stats file",
+        default=".local/replay_results/stats.json",
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        help="Limit the number of requests",
+        default=-1,
+    )
+    parser.add_argument(
+        "--environment",
+        type=str,
+        help="JSON file containing initial environment configuration (nodes, GPUs, bandwidth, etc.)",
+        default=None,
+    )
+    parser.add_argument(
+        "--environment-change-file",
+        type=str,
+        help="JSONL file containing dynamic environment changes (timestamp, gpu_name, amount)",
+        default=None,
+    )
+    parser.add_argument(
+        "--print-interval",
+        type=float,
+        help="Print interval for progress updates in seconds (default: 0.1)",
+        default=0.1,
+    )
+    args = parser.parse_args()
+    run_simulation(args)