[None][doc] Move AutoDeploy README.md to torch docs (#6528)

Fridah-nv · suyoggupta · web-flow · commit cc0f4c87d43c · 2025-08-08T19:11:45.000-04:00
Signed-off-by: Frida Hou &lt;201670829+Fridah-nv@users.noreply.github.com&gt;
Signed-off-by: Suyog Gupta &lt;41447211+suyoggupta@users.noreply.github.com&gt;
Co-authored-by: Suyog Gupta &lt;41447211+suyoggupta@users.noreply.github.com&gt;
diff --git a/docs/source/media/ad_overview.png b/docs/source/media/ad_overview.png
diff --git a/docs/source/torch.md b/docs/source/torch.md
@@ -38,3 +38,7 @@ Here is a simple example to show how to use `tensorrt_llm.LLM` API with Llama mo
 ## Known Issues
 
 - The PyTorch backend on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
+
+## Prototype Features
+
+- [AutoDeploy: Seamless Model Deployment from PyTorch to TensorRT-LLM](./torch/auto_deploy/auto-deploy.md)
diff --git a/docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md b/docs/source/torch/auto_deploy/advanced/benchmarking_with_trtllm_bench.md
@@ -0,0 +1,93 @@
+# Benchmarking with trtllm-bench
+
+AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utility, enabling you to measure comprehensive performance metrics such as token throughput, request throughput, and latency for your AutoDeploy-optimized models.
+
+## Getting Started
+
+Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
+
+## Basic Usage
+
+Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:
+
+```bash
+trtllm-bench \
+  --model meta-llama/Llama-3.1-8B \
+  throughput \
+  --dataset /tmp/synthetic_128_128.txt \
+  --backend _autodeploy
+```
+
+```{note}
+As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
+```
+
+## Advanced Configuration
+
+For more granular control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
+
+```bash
+trtllm-bench \
+  --model meta-llama/Llama-3.1-8B \
+  throughput \
+  --dataset /tmp/synthetic_128_128.txt \
+  --backend _autodeploy \
+  --extra_llm_api_options autodeploy_config.yaml
+```
+
+### Configuration Examples
+
+#### Basic Performance Configuration (`autodeploy_config.yaml`)
+
+```yaml
+# Compilation backend
+compile_backend: torch-opt
+
+# Runtime engine
+runtime: trtllm
+
+# Model loading
+skip_loading_weights: false
+
+# Fraction of free memory to use for kv-caches
+free_mem_ratio: 0.8
+
+# CUDA Graph optimization
+cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
+
+# Attention backend
+attn_backend: flashinfer
+
+# Sequence configuration
+max_batch_size: 256
+```
+
+Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
+
+## Configuration Options Reference
+
+### Core Performance Settings
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `compile_backend` | `torch-compile` | Compilation backend: `torch-simple`, `torch-compile`, `torch-cudagraph`, `torch-opt` |
+| `runtime` | `trtllm` | Runtime engine: `trtllm`, `demollm` |
+| `free_mem_ratio` | `0.0` | Fraction of available GPU memory for KV cache (0.0-1.0) |
+| `skip_loading_weights` | `false` | Skip weight loading for architecture-only benchmarks |
+
+### CUDA Graph Optimization
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |
+
+```{tip}
+For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
+```
+
+## Performance Optimization Tips
+
+1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
+1. **Compilation Backend**: Use `torch-opt` for production workloads
+1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
+1. **CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.
diff --git a/docs/source/torch/auto_deploy/advanced/example_run.md b/docs/source/torch/auto_deploy/advanced/example_run.md
@@ -0,0 +1,49 @@
+# Example Run Script
+
+To build and run AutoDeploy example, use the `examples/auto_deploy/build_and_run_ad.py` script:
+
+```bash
+cd examples/auto_deploy
+python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
+```
+
+You can configure your experiment with various options. Use the `-h/--help` flag to see available options:
+
+```bash
+python build_and_run_ad.py --help
+```
+
+The following is a non-exhaustive list of common configuration options:
+
+| Configuration Key | Description |
+|-------------------|-------------|
+| `--model` | The HF model card or path to a HF checkpoint folder |
+| `--args.model-factory` | Choose model factory implementation (`"AutoModelForCausalLM"`, ...) |
+| `--args.skip-loading-weights` | Only load the architecture, not the weights |
+| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
+| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
+| `--args.world-size` | The number of GPUs used for auto-sharding the model |
+| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
+| `--args.compile-backend` | Specifies how to compile the graph at the end |
+| `--args.attn-backend` | Specifies kernel implementation for attention |
+| `--args.mla-backend` | Specifies implementation for multi-head latent attention |
+| `--args.max-seq-len` | Maximum sequence length for inference/cache |
+| `--args.max-batch-size` | Maximum dimension for statically allocated KV cache |
+| `--args.attn-page-size` | Page size for attention |
+| `--prompt.batch-size` | Number of queries to generate |
+| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |
+
+For default values and additional configuration options, refer to the `ExperimentConfig` class in `examples/auto_deploy/build_and_run_ad.py` file.
+
+The following is a more complete example of using the script:
+
+```bash
+cd examples/auto_deploy
+python build_and_run_ad.py \
+--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
+--args.world-size 2 \
+--args.runtime "demollm" \
+--args.compile-backend "torch-compile" \
+--args.attn-backend "flashinfer" \
+--benchmark.enabled True
+```
diff --git a/docs/source/torch/auto_deploy/advanced/expert_configurations.md b/docs/source/torch/auto_deploy/advanced/expert_configurations.md
@@ -0,0 +1,178 @@
+# Expert Configuration of LLM API
+
+For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TRT-LLM argument list.
+
+- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
+  Please make sure to refer to those first.
+- For advanced users, the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args` can be used to configure the AutoDeploy `LLM` API, including runtime options.
+- Note that some fields in the full `LlmArgs`
+  object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
+  pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
+  significantly differs from the default manual workflow in TensorRT-LLM.
+- However, with the proper care the full `LlmArgs`
+  objects can be used to configure advanced runtime options in TensorRT-LLM.
+- Any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
+
+# Expert Configuration of `build_and_run_ad.py`
+
+For advanced users, `build_and_run_ad.py` provides advanced configuration capabilities using a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and utilize sophisticated configuration precedence rules to create complex deployment configurations.
+
+## CLI Arguments with Dot Notation
+
+The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig` or `LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
+
+```bash
+# Configure model parameters
+# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
+# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
+# specified as CLI arg
+python build_and_run_ad.py \
+  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
+  --args.model-kwargs.num-hidden-layers=10 \
+  --args.model-kwargs.hidden-size=2048 \
+  --args.tokenizer-kwargs.padding-side=left
+
+# Configure runtime and backend options
+python build_and_run_ad.py \
+  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
+  --args.world-size=2 \
+  --args.compile-backend=torch-opt \
+  --args.attn-backend=flashinfer
+
+# Configure prompting and benchmarking
+python build_and_run_ad.py \
+  --model "microsoft/phi-4" \
+  --prompt.batch-size=4 \
+  --prompt.sp-kwargs.max-tokens=200 \
+  --prompt.sp-kwargs.temperature=0.7 \
+  --benchmark.enabled=true \
+  --benchmark.bs=8 \
+  --benchmark.isl=1024
+```
+
+## YAML Configuration Files
+
+Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, which enables you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
+
+Create a YAML configuration file (e.g., `my_config.yaml`):
+
+```yaml
+# my_config.yaml
+args:
+  model_kwargs:
+    num_hidden_layers: 12
+    hidden_size: 1024
+  world_size: 4
+  compile_backend: torch-compile
+  attn_backend: triton
+  max_seq_len: 2048
+  max_batch_size: 16
+  transforms:
+    sharding:
+      strategy: auto
+    quantization:
+      enabled: false
+
+prompt:
+  batch_size: 8
+  sp_kwargs:
+    max_tokens: 150
+    temperature: 0.8
+    top_k: 50
+
+benchmark:
+  enabled: true
+  num: 20
+  bs: 4
+  isl: 1024
+  osl: 256
+```
+
+Create an additional override file (e.g., `production.yaml`):
+
+```yaml
+# production.yaml
+args:
+  world_size: 8
+  compile_backend: torch-opt
+  max_batch_size: 32
+
+benchmark:
+  enabled: false
+```
+
+Then use these configurations:
+
+```bash
+# Using single YAML config
+python build_and_run_ad.py \
+  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
+  --yaml-configs my_config.yaml
+
+# Using multiple YAML configs (deep merged in order, later files have higher priority)
+python build_and_run_ad.py \
+  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
+  --yaml-configs my_config.yaml production.yaml
+
+# Targeting nested AutoDeployConfig with separate YAML
+python build_and_run_ad.py \
+  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
+  --yaml-configs my_config.yaml \
+  --args.yaml-configs autodeploy_overrides.yaml
+```
+
+## Configuration Precedence and Deep Merging
+
+The configuration system follows a precedence order in which higher priority sources override lower priority ones:
+
+1. **CLI Arguments** (highest priority) - Direct command line arguments
+1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
+1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
+
+**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:
+
+```yaml
+# Base config
+args:
+  model_kwargs:
+    num_hidden_layers: 10
+    hidden_size: 1024
+  max_seq_len: 2048
+```
+
+```yaml
+# Override config
+args:
+  model_kwargs:
+    hidden_size: 2048  # This will override
+    # num_hidden_layers: 10 remains unchanged
+  world_size: 4  # This gets added
+```
+
+**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:
+
+```bash
+# The outer yaml-configs affects the entire ExperimentConfig
+# The inner args.yaml-configs affects only the AutoDeployConfig
+python build_and_run_ad.py \
+  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
+  --yaml-configs experiment_config.yaml \
+  --args.yaml-configs autodeploy_config.yaml \
+  --args.world-size=8  # CLI override beats both YAML configs
+```
+
+## Built-in Default Configuration
+
+Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
+
+The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
+
+```bash
+# View the default configuration
+cat tensorrt_llm/_torch/auto_deploy/config/default.yaml
+
+# Override specific transform settings
+python build_and_run_ad.py \
+  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
+  --args.transforms.export-to-gm.strict=true
+```
diff --git a/docs/source/torch/auto_deploy/advanced/logging.md b/docs/source/torch/auto_deploy/advanced/logging.md
@@ -0,0 +1,14 @@
+# Logging Level
+
+Use the following env variable to specify the logging level of our built-in logger, ordered by
+decreasing verbosity;
+
+```bash
+AUTO_DEPLOY_LOG_LEVEL=DEBUG
+AUTO_DEPLOY_LOG_LEVEL=INFO
+AUTO_DEPLOY_LOG_LEVEL=WARNING
+AUTO_DEPLOY_LOG_LEVEL=ERROR
+AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
+```
+
+The default log level is `INFO`.
diff --git a/docs/source/torch/auto_deploy/advanced/workflow.md b/docs/source/torch/auto_deploy/advanced/workflow.md
@@ -0,0 +1,30 @@
+### Incorporating `auto_deploy` into your own workflow
+
+AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.
+
+The following example demonstrates how to build an LLM object with AutoDeploy integration:
+
+```
+from tensorrt_llm._torch.auto_deploy import LLM
+
+
+# Construct the LLM high-level interface object with autodeploy as backend
+llm = LLM(
+    model=<HF_MODEL_CARD_OR_DIR>,
+    world_size=<DESIRED_WORLD_SIZE>,
+    compile_backend="torch-compile",
+    model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
+    attn_backend="flashinfer", # choose between "triton" and "flashinfer"
+    attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
+    skip_loading_weights=False,
+    model_factory="AutoModelForCausalLM", # choose appropriate model factory
+    mla_backend="MultiHeadLatentAttention", # for models that support MLA
+    free_mem_ratio=0.8, # fraction of available memory for cache
+    simple_shard_only=False, # tensor parallelism sharding strategy
+    max_seq_len=<MAX_SEQ_LEN>,
+    max_batch_size=<MAX_BATCH_SIZE>,
+)
+
+```
+
+For more information about configuring AutoDeploy via the `LLM` API using `**kwargs`, see the AutoDeploy LLM API in `tensorrt_llm._torch.auto_deploy.llm` and the `AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`.
diff --git a/docs/source/torch/auto_deploy/auto-deploy.md b/docs/source/torch/auto_deploy/auto-deploy.md
diff --git a/docs/source/torch/auto_deploy/support_matrix.md b/docs/source/torch/auto_deploy/support_matrix.md