Skip to content

Commit 0d54263

Browse files
committed
moving AutoDeploy README to doc
Signed-off-by: Frida Hou <[email protected]>
1 parent baece56 commit 0d54263

File tree

9 files changed

+443
-0
lines changed

9 files changed

+443
-0
lines changed

docs/source/auto-deploy.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# AutoDeploy
2+
3+
```{note}
4+
Note:
5+
This project is in active development and is currently in an early (beta) stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability.
6+
```
7+
8+
<h4> Seamless Model Deployment from PyTorch to TRT-LLM</h4>
9+
10+
AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
11+
12+
13+
## Motivation & Approach
14+
15+
Deploying large language models (LLMs) can be challenging, especially when balancing ease of use with high performance. Teams need simple, intuitive deployment solutions that reduce engineering effort, speed up the integration of new models, and support rapid experimentation without compromising performance.
16+
17+
AutoDeploy addresses these challenges with a streamlined, (semi-)automated pipeline that transforms in-framework PyTorch models, including Hugging Face models, into optimized inference-ready models for TRT-LLM. It simplifies deployment, optimizes models for efficient inference, and bridges the gap between simplicity and performance.
18+
19+
### **Key Features:**
20+
21+
- **Seamless Model Transition:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
22+
- **Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
23+
- **Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
24+
- **Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
25+
- **Quick Setup & Prototyping:** Lightweight pip package for easy installation with a demo environment for fast testing.
26+
27+
## Get Started
28+
29+
1. **Install AutoDeploy:**
30+
31+
AutoDeploy is accessible through TRT-LLM installation.
32+
33+
```bash
34+
sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
35+
```
36+
37+
You can refer to [TRT-LLM installation guide](./installation/linux.md) for more information.
38+
39+
2. **Run Llama Example:**
40+
41+
You are ready to run an in-framework LLama Demo now.
42+
43+
The general entrypoint to run the auto-deploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
44+
45+
```bash
46+
cd examples/auto_deploy
47+
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
48+
```
49+
50+
## Support Matrix
51+
52+
AutoDeploy streamlines the model deployment process through an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.
53+
54+
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TRT-LLM runtime.
55+
56+
- [Supported Matrix](./auto_deploy/support_matrix.md)
57+
58+
59+
## Advanced Usage
60+
61+
- [Example Run Script](./auto_deploy/advanced/example_run.md)
62+
- [Logging Level](./auto_deploy/advanced/logging.md)
63+
- [Model Evaluation with LM Evaluation Harness](./auto_deploy/advanced/model_eval.md)
64+
- [Mixed-precision Quantization using TensorRT Model Optimizer](./auto_deploy/advanced/mixed_precision_quantization.md)
65+
- [Incorporating auto_deploy into your own workflow](./auto_deploy/advanced/workflow.md)
66+
- [Expert Configurations](./auto_deploy/advanced/expert_configurations.md)
67+
68+
## Roadmap
69+
70+
Check out our [Github Project Board](https://github.com/orgs/NVIDIA/projects/83) to learn more about
71+
the current progress in AutoDeploy and where you can help.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Example Run Script ([`build_and_run_ad.py`](./build_and_run_ad.py))
2+
3+
To build and run AutoDeploy example, use the [`build_and_run_ad.py`](./build_and_run_ad.py) script:
4+
5+
```bash
6+
cd examples/auto_deploy
7+
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
8+
```
9+
10+
You can arbitrarily configure your experiment. Use the `-h/--help` flag to see available options:
11+
12+
```bash
13+
python build_and_run_ad.py --help
14+
```
15+
16+
Below is a non-exhaustive list of common config options:
17+
18+
| Configuration Key | Description |
19+
|-------------------|-------------|
20+
| `--model` | The HF model card or path to a HF checkpoint folder |
21+
| `--args.model-factory` | Choose model factory implementation (`"AutoModelForCausalLM"`, ...) |
22+
| `--args.skip-loading-weights` | Only load the architecture, not the weights |
23+
| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
24+
| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
25+
| `--args.world-size` | The number of GPUs used for auto-sharding the model |
26+
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
27+
| `--args.compile-backend` | Specifies how to compile the graph at the end |
28+
| `--args.attn-backend` | Specifies kernel implementation for attention |
29+
| `--args.mla-backend` | Specifies implementation for multi-head latent attention |
30+
| `--args.max-seq-len` | Maximum sequence length for inference/cache |
31+
| `--args.max-batch-size` | Maximum dimension for statically allocated KV cache |
32+
| `--args.attn-page-size` | Page size for attention |
33+
| `--prompt.batch-size` | Number of queries to generate |
34+
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |
35+
36+
For default values and additional configuration options, refer to the [`ExperimentConfig`](./build_and_run_ad.py) class in [build_and_run_ad.py](./build_and_run_ad.py) file.
37+
38+
Here is a more complete example of using the script:
39+
40+
```bash
41+
cd examples/auto_deploy
42+
python build_and_run_ad.py \
43+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
44+
--args.world-size 2 \
45+
--args.runtime "demollm" \
46+
--args.compile-backend "torch-compile" \
47+
--args.attn-backend "flashinfer" \
48+
--benchmark.enabled True
49+
```
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Expert Configuration of LLM API
2+
3+
For expert TensorRT-LLM users, we also expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
4+
*at your own risk* (the argument list diverges from TRT-LLM's argument list):
5+
6+
- All config fields that are used by the AutoDeploy core pipeline (i.e. the `InferenceOptimizer`) are
7+
_exclusively_ exposed in the [`AutoDeployConfig` class](../../tensorrt_llm/_torch/auto_deploy/llm_args.py).
8+
Please make sure to refer to those first.
9+
- For expert users we expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
10+
that can be used to configure the [AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py) including runtime options.
11+
- Note that some fields in the full [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
12+
object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
13+
pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
14+
significantly differs from the default manual workflow in TensorRT-LLM.
15+
- However, with the proper care the full [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
16+
objects can be used to configure advanced runtime options in TensorRT-LLM.
17+
- Note that any valid field can be simply provided as keyword argument ("`**kwargs`") to the
18+
[AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py).
19+
20+
# Expert Configuration of `build_and_run_ad.py`
21+
22+
For expert users, `build_and_run_ad.py` provides advanced configuration capabilities through a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and leverage sophisticated configuration precedence rules to create complex deployment configurations.
23+
24+
## CLI Arguments with Dot Notation
25+
26+
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the [`ExperimentConfig`](./build_and_run_ad.py) and nested [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.) objects:
27+
28+
```bash
29+
# Configure model parameters
30+
# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
31+
# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
32+
# specified as CLI arg
33+
python build_and_run_ad.py \
34+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
35+
--args.model-kwargs.num-hidden-layers=10 \
36+
--args.model-kwargs.hidden-size=2048 \
37+
--args.tokenizer-kwargs.padding-side=left
38+
39+
# Configure runtime and backend settings
40+
python build_and_run_ad.py \
41+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
42+
--args.world-size=2 \
43+
--args.compile-backend=torch-opt \
44+
--args.attn-backend=flashinfer
45+
46+
# Configure prompting and benchmarking
47+
python build_and_run_ad.py \
48+
--model "microsoft/phi-4" \
49+
--prompt.batch-size=4 \
50+
--prompt.sp-kwargs.max-tokens=200 \
51+
--prompt.sp-kwargs.temperature=0.7 \
52+
--benchmark.enabled=true \
53+
--benchmark.bs=8 \
54+
--benchmark.isl=1024
55+
```
56+
57+
## YAML Configuration Files
58+
59+
Both [`ExperimentConfig`](./build_and_run_ad.py) and [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) inherit from [`DynamicYamlMixInForSettings`](../../tensorrt_llm/_torch/auto_deploy/utils/_config.py), enabling you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
60+
61+
Create a YAML configuration file (e.g., `my_config.yaml`):
62+
63+
```yaml
64+
# my_config.yaml
65+
args:
66+
model_kwargs:
67+
num_hidden_layers: 12
68+
hidden_size: 1024
69+
world_size: 4
70+
compile_backend: torch-compile
71+
attn_backend: triton
72+
max_seq_len: 2048
73+
max_batch_size: 16
74+
transforms:
75+
sharding:
76+
strategy: auto
77+
quantization:
78+
enabled: false
79+
80+
prompt:
81+
batch_size: 8
82+
sp_kwargs:
83+
max_tokens: 150
84+
temperature: 0.8
85+
top_k: 50
86+
87+
benchmark:
88+
enabled: true
89+
num: 20
90+
bs: 4
91+
isl: 1024
92+
osl: 256
93+
```
94+
95+
Create an additional override file (e.g., `production.yaml`):
96+
97+
```yaml
98+
# production.yaml
99+
args:
100+
world_size: 8
101+
compile_backend: torch-opt
102+
max_batch_size: 32
103+
104+
benchmark:
105+
enabled: false
106+
```
107+
108+
Then use these configurations:
109+
110+
```bash
111+
# Using single YAML config
112+
python build_and_run_ad.py \
113+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
114+
--yaml-configs my_config.yaml
115+
116+
# Using multiple YAML configs (deep merged in order, later files have higher priority)
117+
python build_and_run_ad.py \
118+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
119+
--yaml-configs my_config.yaml production.yaml
120+
121+
# Targeting nested AutoDeployConfig with separate YAML
122+
python build_and_run_ad.py \
123+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
124+
--yaml-configs my_config.yaml \
125+
--args.yaml-configs autodeploy_overrides.yaml
126+
```
127+
128+
## Configuration Precedence and Deep Merging
129+
130+
The configuration system follows a strict precedence order where higher priority sources override lower priority ones:
131+
132+
1. **CLI Arguments** (highest priority) - Direct command line arguments
133+
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
134+
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
135+
136+
**Deep Merging**: Unlike simple overwriting, deep merging intelligently combines nested dictionaries recursively. For example:
137+
138+
```yaml
139+
# Base config
140+
args:
141+
model_kwargs:
142+
num_hidden_layers: 10
143+
hidden_size: 1024
144+
max_seq_len: 2048
145+
```
146+
147+
```yaml
148+
# Override config
149+
args:
150+
model_kwargs:
151+
hidden_size: 2048 # This will override
152+
# num_hidden_layers: 10 remains unchanged
153+
world_size: 4 # This gets added
154+
```
155+
156+
**Nested Config Behavior**: When using nested configurations, outer YAML configs become init settings for inner objects, giving them higher precedence:
157+
158+
```bash
159+
# The outer yaml-configs affects the entire ExperimentConfig
160+
# The inner args.yaml-configs affects only the AutoDeployConfig
161+
python build_and_run_ad.py \
162+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
163+
--yaml-configs experiment_config.yaml \
164+
--args.yaml-configs autodeploy_config.yaml \
165+
--args.world-size=8 # CLI override beats both YAML configs
166+
```
167+
168+
## Built-in Default Configuration
169+
170+
Both [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) and [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) classes automatically load a built-in [`default.yaml`](../../tensorrt_llm/_torch/auto_deploy/config/default.yaml) configuration file that provides sensible defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the [`_get_config_dict()`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) function and defines default transform configurations for graph optimization stages.
171+
172+
The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
173+
174+
```bash
175+
# View the default configuration
176+
cat tensorrt_llm/_torch/auto_deploy/config/default.yaml
177+
178+
# Override specific transform settings
179+
python build_and_run_ad.py \
180+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
181+
--args.transforms.export-to-gm.strict=true
182+
```
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Logging Level
2+
3+
Use the following env variable to specify the logging level of our built-in logger ordered by
4+
decreasing verbosity;
5+
6+
```bash
7+
AUTO_DEPLOY_LOG_LEVEL=DEBUG
8+
AUTO_DEPLOY_LOG_LEVEL=INFO
9+
AUTO_DEPLOY_LOG_LEVEL=WARNING
10+
AUTO_DEPLOY_LOG_LEVEL=ERROR
11+
AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
12+
```
13+
14+
The default level is `INFO`.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Mixed-precision Quantization using TensorRT Model Optimizer
2+
3+
TensorRT Model Optimizer [AutoQuantize](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) algorithm is a PTQ algorithm from ModelOpt which quantizes a model by searching for the best quantization format per-layer while meeting the performance constraint specified by the user. This way, `AutoQuantize` enables to trade-off model accuracy for performance.
4+
5+
Currently `AutoQuantize` supports only `effective_bits` as the performance constraint (for both weight-only quantization and weight & activation quantization). See
6+
[AutoQuantize documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) for more details.
7+
8+
## 1. Quantize a model with ModelOpt
9+
10+
Refer to [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_autodeploy/README.md) for generating quantized model checkpoint.
11+
12+
## 2. Deploy the quantized model with AutoDeploy
13+
14+
```bash
15+
cd examples/auto_deploy
16+
python build_and_run_ad.py --model "<MODELOPT_CKPT_PATH>" --args.world-size 1
17+
```
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Model Evaluation with LM Evaluation Harness
2+
3+
lm-evaluation-harness is supported. To run the evaluation, please use the following command:
4+
5+
```bash
6+
# model is defined the same as above. Other config args can also be specified in the model_args (comma separated).
7+
# You can specify any tasks supported with lm-evaluation-harness.
8+
cd examples/auto_deploy
9+
python lm_eval_ad.py \
10+
--model autodeploy --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,world_size=2 --tasks mmlu
11+
```
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
### Incorporating `auto_deploy` into your own workflow
2+
3+
AutoDeploy can be seamlessly integrated into your existing workflows using TRT-LLM's LLM high-level API. This section provides a blueprint for configuring and invoking AutoDeploy within your custom applications.
4+
5+
Here is an example of how you can build an LLM object with AutoDeploy integration:
6+
7+
```
8+
from tensorrt_llm._torch.auto_deploy import LLM
9+
10+
11+
# Construct the LLM high-level interface object with autodeploy as backend
12+
llm = LLM(
13+
model=<HF_MODEL_CARD_OR_DIR>,
14+
world_size=<DESIRED_WORLD_SIZE>,
15+
compile_backend="torch-compile",
16+
model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
17+
attn_backend="flashinfer", # choose between "triton" and "flashinfer"
18+
attn_page_size=64, # page size for attention (tokens_per_block, should be == max_seq_len for triton)
19+
skip_loading_weights=False,
20+
model_factory="AutoModelForCausalLM", # choose appropriate model factory
21+
mla_backend="MultiHeadLatentAttention", # for models that support MLA
22+
free_mem_ratio=0.8, # fraction of available memory for cache
23+
simple_shard_only=False, # tensor parallelism sharding strategy
24+
max_seq_len=<MAX_SEQ_LEN>,
25+
max_batch_size=<MAX_BATCH_SIZE>,
26+
)
27+
28+
```
29+
30+
Please consult the [AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py) and the
31+
[`AutoDeployConfig` class](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
32+
for more detail on how AutoDeploy is configured via the `**kwargs` of the `LLM` API.

0 commit comments

Comments
 (0)