Skip to content

Commit cc0f4c8

Browse files
[None][doc] Move AutoDeploy README.md to torch docs (#6528)
Signed-off-by: Frida Hou <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Suyog Gupta <[email protected]>
1 parent efcb8f7 commit cc0f4c8

File tree

9 files changed

+575
-0
lines changed

9 files changed

+575
-0
lines changed

docs/source/media/ad_overview.png

209 KB
Loading

docs/source/torch.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,7 @@ Here is a simple example to show how to use `tensorrt_llm.LLM` API with Llama mo
3838
## Known Issues
3939

4040
- The PyTorch backend on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.
41+
42+
## Prototype Features
43+
44+
- [AutoDeploy: Seamless Model Deployment from PyTorch to TensorRT-LLM](./torch/auto_deploy/auto-deploy.md)
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Benchmarking with trtllm-bench
2+
3+
AutoDeploy is integrated with the `trtllm-bench` performance benchmarking utility, enabling you to measure comprehensive performance metrics such as token throughput, request throughput, and latency for your AutoDeploy-optimized models.
4+
5+
## Getting Started
6+
7+
Before benchmarking with AutoDeploy, review the [TensorRT-LLM benchmarking guide](../../performance/perf-benchmarking.md#running-with-the-pytorch-workflow) to familiarize yourself with the standard trtllm-bench workflow and best practices.
8+
9+
## Basic Usage
10+
11+
Invoke the AutoDeploy backend by specifying `--backend _autodeploy` in your `trtllm-bench` command:
12+
13+
```bash
14+
trtllm-bench \
15+
--model meta-llama/Llama-3.1-8B \
16+
throughput \
17+
--dataset /tmp/synthetic_128_128.txt \
18+
--backend _autodeploy
19+
```
20+
21+
```{note}
22+
As in the PyTorch workflow, AutoDeploy does not require a separate `trtllm-bench build` step. The model is automatically optimized during benchmark initialization.
23+
```
24+
25+
## Advanced Configuration
26+
27+
For more granular control over AutoDeploy's behavior during benchmarking, use the `--extra_llm_api_options` flag with a YAML configuration file:
28+
29+
```bash
30+
trtllm-bench \
31+
--model meta-llama/Llama-3.1-8B \
32+
throughput \
33+
--dataset /tmp/synthetic_128_128.txt \
34+
--backend _autodeploy \
35+
--extra_llm_api_options autodeploy_config.yaml
36+
```
37+
38+
### Configuration Examples
39+
40+
#### Basic Performance Configuration (`autodeploy_config.yaml`)
41+
42+
```yaml
43+
# Compilation backend
44+
compile_backend: torch-opt
45+
46+
# Runtime engine
47+
runtime: trtllm
48+
49+
# Model loading
50+
skip_loading_weights: false
51+
52+
# Fraction of free memory to use for kv-caches
53+
free_mem_ratio: 0.8
54+
55+
# CUDA Graph optimization
56+
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
57+
58+
# Attention backend
59+
attn_backend: flashinfer
60+
61+
# Sequence configuration
62+
max_batch_size: 256
63+
```
64+
65+
Enable multi-GPU execution by specifying `--tp n`, where `n` is the number of GPUs
66+
67+
## Configuration Options Reference
68+
69+
### Core Performance Settings
70+
71+
| Parameter | Default | Description |
72+
|-----------|---------|-------------|
73+
| `compile_backend` | `torch-compile` | Compilation backend: `torch-simple`, `torch-compile`, `torch-cudagraph`, `torch-opt` |
74+
| `runtime` | `trtllm` | Runtime engine: `trtllm`, `demollm` |
75+
| `free_mem_ratio` | `0.0` | Fraction of available GPU memory for KV cache (0.0-1.0) |
76+
| `skip_loading_weights` | `false` | Skip weight loading for architecture-only benchmarks |
77+
78+
### CUDA Graph Optimization
79+
80+
| Parameter | Default | Description |
81+
|-----------|---------|-------------|
82+
| `cuda_graph_batch_sizes` | `null` | List of batch sizes for CUDA graph creation |
83+
84+
```{tip}
85+
For optimal CUDA graph performance, specify batch sizes that match your expected workload patterns. For example: `[1, 2, 4, 8, 16, 32, 64, 128]`
86+
```
87+
88+
## Performance Optimization Tips
89+
90+
1. **Memory Management**: Set `free_mem_ratio` to 0.8-0.9 for optimal KV cache utilization
91+
1. **Compilation Backend**: Use `torch-opt` for production workloads
92+
1. **Attention Backend**: `flashinfer` generally provides the best performance for most models
93+
1. **CUDA Graphs**: Enable CUDA graphs for batch sizes that match your production traffic patterns.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Example Run Script
2+
3+
To build and run AutoDeploy example, use the `examples/auto_deploy/build_and_run_ad.py` script:
4+
5+
```bash
6+
cd examples/auto_deploy
7+
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
8+
```
9+
10+
You can configure your experiment with various options. Use the `-h/--help` flag to see available options:
11+
12+
```bash
13+
python build_and_run_ad.py --help
14+
```
15+
16+
The following is a non-exhaustive list of common configuration options:
17+
18+
| Configuration Key | Description |
19+
|-------------------|-------------|
20+
| `--model` | The HF model card or path to a HF checkpoint folder |
21+
| `--args.model-factory` | Choose model factory implementation (`"AutoModelForCausalLM"`, ...) |
22+
| `--args.skip-loading-weights` | Only load the architecture, not the weights |
23+
| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
24+
| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
25+
| `--args.world-size` | The number of GPUs used for auto-sharding the model |
26+
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
27+
| `--args.compile-backend` | Specifies how to compile the graph at the end |
28+
| `--args.attn-backend` | Specifies kernel implementation for attention |
29+
| `--args.mla-backend` | Specifies implementation for multi-head latent attention |
30+
| `--args.max-seq-len` | Maximum sequence length for inference/cache |
31+
| `--args.max-batch-size` | Maximum dimension for statically allocated KV cache |
32+
| `--args.attn-page-size` | Page size for attention |
33+
| `--prompt.batch-size` | Number of queries to generate |
34+
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |
35+
36+
For default values and additional configuration options, refer to the `ExperimentConfig` class in `examples/auto_deploy/build_and_run_ad.py` file.
37+
38+
The following is a more complete example of using the script:
39+
40+
```bash
41+
cd examples/auto_deploy
42+
python build_and_run_ad.py \
43+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
44+
--args.world-size 2 \
45+
--args.runtime "demollm" \
46+
--args.compile-backend "torch-compile" \
47+
--args.attn-backend "flashinfer" \
48+
--benchmark.enabled True
49+
```
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Expert Configuration of LLM API
2+
3+
For advanced TensorRT-LLM users, the full set of `tensorrt_llm._torch.auto_deploy.llm_args.LlmArgs` is exposed. Use at your own risk. The argument list may diverge from the standard TRT-LLM argument list.
4+
5+
- All configuration fields used by the AutoDeploy core pipeline, `InferenceOptimizer`, are exposed exclusively in `AutoDeployConfi`g in `tensorrt_llm._torch.auto_deploy.llm_args`.
6+
Please make sure to refer to those first.
7+
- For advanced users, the full set of `LlmArgs` in `tensorrt_llm._torch.auto_deploy.llm_args` can be used to configure the AutoDeploy `LLM` API, including runtime options.
8+
- Note that some fields in the full `LlmArgs`
9+
object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
10+
pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
11+
significantly differs from the default manual workflow in TensorRT-LLM.
12+
- However, with the proper care the full `LlmArgs`
13+
objects can be used to configure advanced runtime options in TensorRT-LLM.
14+
- Any valid field can be simply provided as keyword argument ("`**kwargs`") to the AutoDeploy `LLM` API.
15+
16+
# Expert Configuration of `build_and_run_ad.py`
17+
18+
For advanced users, `build_and_run_ad.py` provides advanced configuration capabilities using a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and utilize sophisticated configuration precedence rules to create complex deployment configurations.
19+
20+
## CLI Arguments with Dot Notation
21+
22+
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the `ExperimentConfig` in `examples/auto_deploy/build_and_run_ad.py` and nested `AutoDeployConfig` or `LlmArgs` objects in `tensorrt_llm._torch.auto_deploy.llm_args`:
23+
24+
```bash
25+
# Configure model parameters
26+
# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
27+
# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
28+
# specified as CLI arg
29+
python build_and_run_ad.py \
30+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
31+
--args.model-kwargs.num-hidden-layers=10 \
32+
--args.model-kwargs.hidden-size=2048 \
33+
--args.tokenizer-kwargs.padding-side=left
34+
35+
# Configure runtime and backend options
36+
python build_and_run_ad.py \
37+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
38+
--args.world-size=2 \
39+
--args.compile-backend=torch-opt \
40+
--args.attn-backend=flashinfer
41+
42+
# Configure prompting and benchmarking
43+
python build_and_run_ad.py \
44+
--model "microsoft/phi-4" \
45+
--prompt.batch-size=4 \
46+
--prompt.sp-kwargs.max-tokens=200 \
47+
--prompt.sp-kwargs.temperature=0.7 \
48+
--benchmark.enabled=true \
49+
--benchmark.bs=8 \
50+
--benchmark.isl=1024
51+
```
52+
53+
## YAML Configuration Files
54+
55+
Both `ExperimentConfig` and `AutoDeployConfig`/`LlmArgs` inherit from `DynamicYamlMixInForSettings`, which enables you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
56+
57+
Create a YAML configuration file (e.g., `my_config.yaml`):
58+
59+
```yaml
60+
# my_config.yaml
61+
args:
62+
model_kwargs:
63+
num_hidden_layers: 12
64+
hidden_size: 1024
65+
world_size: 4
66+
compile_backend: torch-compile
67+
attn_backend: triton
68+
max_seq_len: 2048
69+
max_batch_size: 16
70+
transforms:
71+
sharding:
72+
strategy: auto
73+
quantization:
74+
enabled: false
75+
76+
prompt:
77+
batch_size: 8
78+
sp_kwargs:
79+
max_tokens: 150
80+
temperature: 0.8
81+
top_k: 50
82+
83+
benchmark:
84+
enabled: true
85+
num: 20
86+
bs: 4
87+
isl: 1024
88+
osl: 256
89+
```
90+
91+
Create an additional override file (e.g., `production.yaml`):
92+
93+
```yaml
94+
# production.yaml
95+
args:
96+
world_size: 8
97+
compile_backend: torch-opt
98+
max_batch_size: 32
99+
100+
benchmark:
101+
enabled: false
102+
```
103+
104+
Then use these configurations:
105+
106+
```bash
107+
# Using single YAML config
108+
python build_and_run_ad.py \
109+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
110+
--yaml-configs my_config.yaml
111+
112+
# Using multiple YAML configs (deep merged in order, later files have higher priority)
113+
python build_and_run_ad.py \
114+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
115+
--yaml-configs my_config.yaml production.yaml
116+
117+
# Targeting nested AutoDeployConfig with separate YAML
118+
python build_and_run_ad.py \
119+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
120+
--yaml-configs my_config.yaml \
121+
--args.yaml-configs autodeploy_overrides.yaml
122+
```
123+
124+
## Configuration Precedence and Deep Merging
125+
126+
The configuration system follows a precedence order in which higher priority sources override lower priority ones:
127+
128+
1. **CLI Arguments** (highest priority) - Direct command line arguments
129+
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
130+
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
131+
132+
**Deep Merging**: Unlike simple overwriting, deep merging recursively combines nested dictionaries. For example:
133+
134+
```yaml
135+
# Base config
136+
args:
137+
model_kwargs:
138+
num_hidden_layers: 10
139+
hidden_size: 1024
140+
max_seq_len: 2048
141+
```
142+
143+
```yaml
144+
# Override config
145+
args:
146+
model_kwargs:
147+
hidden_size: 2048 # This will override
148+
# num_hidden_layers: 10 remains unchanged
149+
world_size: 4 # This gets added
150+
```
151+
152+
**Nested Config Behavior**: When using nested configurations, outer YAML configuration files become initialization settings for inner objects, giving them higher precedence:
153+
154+
```bash
155+
# The outer yaml-configs affects the entire ExperimentConfig
156+
# The inner args.yaml-configs affects only the AutoDeployConfig
157+
python build_and_run_ad.py \
158+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
159+
--yaml-configs experiment_config.yaml \
160+
--args.yaml-configs autodeploy_config.yaml \
161+
--args.world-size=8 # CLI override beats both YAML configs
162+
```
163+
164+
## Built-in Default Configuration
165+
166+
Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
167+
168+
The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
169+
170+
```bash
171+
# View the default configuration
172+
cat tensorrt_llm/_torch/auto_deploy/config/default.yaml
173+
174+
# Override specific transform settings
175+
python build_and_run_ad.py \
176+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
177+
--args.transforms.export-to-gm.strict=true
178+
```
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Logging Level
2+
3+
Use the following env variable to specify the logging level of our built-in logger, ordered by
4+
decreasing verbosity;
5+
6+
```bash
7+
AUTO_DEPLOY_LOG_LEVEL=DEBUG
8+
AUTO_DEPLOY_LOG_LEVEL=INFO
9+
AUTO_DEPLOY_LOG_LEVEL=WARNING
10+
AUTO_DEPLOY_LOG_LEVEL=ERROR
11+
AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
12+
```
13+
14+
The default log level is `INFO`.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
### Incorporating `auto_deploy` into your own workflow
2+
3+
AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM's LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.
4+
5+
The following example demonstrates how to build an LLM object with AutoDeploy integration:
6+
7+
```
8+
from tensorrt_llm._torch.auto_deploy import LLM
9+
10+
11+
# Construct the LLM high-level interface object with autodeploy as backend
12+
llm = LLM(
13+
model=<HF_MODEL_CARD_OR_DIR>,
14+
world_size=<DESIRED_WORLD_SIZE>,
15+
compile_backend="torch-compile",
16+
model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
17+
attn_backend="flashinfer", # choose between "triton" and "flashinfer"
18+
attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
19+
skip_loading_weights=False,
20+
model_factory="AutoModelForCausalLM", # choose appropriate model factory
21+
mla_backend="MultiHeadLatentAttention", # for models that support MLA
22+
free_mem_ratio=0.8, # fraction of available memory for cache
23+
simple_shard_only=False, # tensor parallelism sharding strategy
24+
max_seq_len=<MAX_SEQ_LEN>,
25+
max_batch_size=<MAX_BATCH_SIZE>,
26+
)
27+
28+
```
29+
30+
For more information about configuring AutoDeploy via the `LLM` API using `**kwargs`, see the AutoDeploy LLM API in `tensorrt_llm._torch.auto_deploy.llm` and the `AutoDeployConfig` class in `tensorrt_llm._torch.auto_deploy.llm_args`.

0 commit comments

Comments
 (0)