Skip to content

Releases: NVIDIA/TensorRT-Model-Optimizer

ModelOpt 0.39.0 Release

13 Nov 07:25
f329b19

Choose a tag to compare

Deprecations

  • Deprecated modelopt.torch._deploy.utils.get_onnx_bytes API. Please use modelopt.torch._deploy.utils.get_onnx_bytes_and_metadata instead to access the ONNX model bytes with external data. See examples/onnx_ptq/download_example_onnx.py for example usage.

New Features

  • Added flag op_types_to_exclude_fp16 in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating 'fp32' precision in trt_plugins_precision.
  • Added LoRA mode support for MCore in a new peft submodule: modelopt.torch.peft.update_model(model, LORA_CFG).
  • Supported PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See examples/vllm_serve for more details.
  • Added support for nemotron-post-training-dataset-v2 and nemotron-post-training-dataset-v1 in examples/llm_ptq. Defaults to a mix of cnn_dailymail and nemotron-post-training-dataset-v2 (gated dataset accessed using the HF_TOKEN environment variable) if no dataset is specified.
  • Allows specifying calib_seq in examples/llm_ptq to set the maximum sequence length for calibration.
  • Added support for MCore MoE PTQ/QAT/QAD.
  • Added support for multi-node PTQ and export with FSDP2 in examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details.
  • Added support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
  • Added flags nodes_to_include and op_types_to_include in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
  • Added support for torch.compile and benchmarking in examples/diffusers/quantization/diffusion_trt.py.
  • Enabled native ModelOpt quantization support for FP8 and NVFP4 formats in SGLang. See SGLang quantization documentation for more details.
  • Added ModelOpt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
  • Added support for exporting QLoRA checkpoints finetuned using ModelOpt.

Documentation

Additional Announcements

  • ModelOpt will change its versioning from odd minor versions to all consecutive versions from next release. This means next release will be named 0.40.0 instead of 0.41.0

ModelOpt 0.37.0 Release

08 Oct 16:43
df0882a

Choose a tag to compare

Deprecations

  • Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM, or TensorRT docker image directly or refer to the installation guide for more details.
  • Deprecated quantize_mode argument in examples/onnx_ptq/evaluate.py to support strong typing. Use engine_precision instead.
  • Deprecated TRT-LLM's TRT backend in examples/llm_ptq and examples/vlm_ptq. Tasks build and benchmark support are removed and replaced with quant. engine_dir is replaced with checkpoint_dir in examples/llm_ptq and examples/vlm_ptq. For performance evaluation, please use trtllm-bench directly.
  • The --export_fmt flag in examples/llm_ptq is removed. By default, we export to the unified Hugging Face checkpoint format.
  • Deprecated examples/vlm_eval as it depends on the deprecated TRT-LLM's TRT backend.

New Features

  • high_precision_dtype defaults to fp16 in ONNX quantization, i.e., quantized output model weights are now FP16 by default.
  • Upgraded TensorRT-LLM dependency to 1.1.0rc2.
  • Support for Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in examples/vlm_ptq.
  • Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
  • Added Minitron pruning example for the Megatron-LM framework. See examples/megatron-lm for more details.

ModelOpt 0.35.1 Release

20 Sep 08:32
0365238

Choose a tag to compare

ModelOpt 0.35.0 Release

04 Sep 05:50
c359cb7

Choose a tag to compare

Deprecations

  • Deprecate torch<2.6 support.
  • Deprecate NeMo 1.0 model support.

Bug Fixes

  • Fix attention head ranking logic for pruning Megatron Core GPT models.

New Features

  • ModelOpt now supports PTQ and QAT for GPT-OSS models. See examples/gpt_oss for end-to-end PTQ/QAT example.
  • Add support for QAT with HuggingFace + DeepSpeed. See examples/gpt_oss for an example.
  • Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See examples/gpt_oss for an example.
  • ModelOpt provides convenient trainers such as :class:QATTrainer, :class:QADTrainer, :class:KDTrainer, :class:QATSFTTrainer which inherits from Huggingface trainers.
    ModelOpt trainers can be used as drop in replacement of the corresponding Huggingface trainer. See usage examples in examples/gpt_oss, examples/llm_qat or examples/llm_distill.
  • (Experimental) Add quantization support for custom TensorRT op in ONNX models.
  • Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
  • Add tree decoding support for Megatron Eagle models.
  • For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
  • Add support for mamba_num_heads, mamba_head_dim, hidden_size and num_layers pruning for Megatron Core Mamba or Hybrid Transformer Mamba models in mcore_minitron (previously mcore_gpt_minitron) mode.
  • Add example for QAT/QAD training with LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>_. See examples/llm_qat/llama_factory for more details.
  • Upgrade TensorRT-LLM dependency to 1.0.0rc6.
  • Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.

ModelOpt 0.33.1 Release

12 Aug 18:50
55b9106

Choose a tag to compare

Bug Fixes

  • Fix a Qwen3 MOE model export issue.

ModelOpt 0.33.0 Release

14 Jul 18:18

Choose a tag to compare

Backward Breaking Changes

  • PyTorch dependencies for modelopt.torch features are no longer optional and pip install nvidia-modelopt is now same as pip install nvidia-modelopt[torch].

New Features

  • Upgrade TensorRT-LLM dependency to 0.20.
  • Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
  • Add support for ONNX models with custom TensorRT ops in Autocast.
  • Add quantization aware distillation (QAD) support in llm_qat example.
  • Add support for BF16 in ONNX quantization.
  • Add per node calibration support in ONNX quantization.
  • ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires transformers>=4.52.0.
  • Support quantization of FSDP2 wrapped models and add FSDP2 support in the llm_qat example.
  • Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.

ModelOpt 0.31.0 Release

05 Jun 21:02

Choose a tag to compare

Backward Breaking Changes

  • NeMo and Megatron-LM distributed checkpoint (torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store a torch checkpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint.
    • quantizer_state of :class:TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> is now stored in extra_state of :class:QuantModule <modelopt.torch.quantization.nn.module.QuantModule> where it used to be stored in the sharded modelopt_state.
    • The dtype and shape of amax and pre_quant_scale stored in the distributed checkpoint are now retored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint.
    • Togather with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
  • auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
    • This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
    • With this change, now user can easily use their own custom quantization formats for auto_quantize.
    • In addition, the quantization_formats now exclude None (indicating "do not quantize") as a valid format because the auto_quantize internally always add "do not quantize" as an option anyway.
  • Model export config is refactored. The quant config in hf_quant_config.json is converted and saved to config.json. hf_quant_config.json will be deprecated soon.

Deprecations

  • Deprecate Python 3.9 support.

New Features

  • Upgrade LLM examples to use TensorRT-LLM 0.19.
  • Add new model support in the llm_ptq example: Qwen3 MoE.
  • ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
  • Add AutoCast tool to convert ONNX models to FP16 or BF16.
  • Add --low_memory_mode flag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.

ModelOpt 0.29.0 Release

09 May 05:26

Choose a tag to compare

Backward Breaking Changes

  • Refactor SequentialQuantizer to improve its implementation and maintainability while preserving its functionality.

Deprecations

  • Deprecate torch<2.4 support.

New Features

  • Upgrade LLM examples to use TensorRT-LLM 0.18.
  • Add new model support in the llm_ptq example: Gemma-3, Llama-Nemotron.
  • Add INT8 real quantization support.
  • Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the mtq.compress <modelopt.torch.quantization.compress> API to accelerate evaluation of quantized models.
  • Use the shape of Pytorch parameters and buffers of TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> to initialize them during restore. This makes quantized model restoring more robust.
  • Support adding new custom quantization calibration algorithms. Please refer to mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate> or custom calibration algorithm doc for more details.
  • Add EAGLE3 (LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM.
  • Add support for --override_shapes flag to ONNX quantization.
    • --calibration_shapes is reserved for the input shapes used for calibration process.
    • --override_shapes is used to override the input shapes of the model with static shapes.
  • Add support for UNet ONNX quantization.
  • Enable concat_elimination pass by default to improve the performance of quantized ONNX models.
  • Enable Redundant Cast elimination pass by default in moq.quantize <modelopt.onnx.quantization.quantize>.
  • Add new attribute parallel_state to DynamicModule <modelopt.torch.opt.dynamic.DynamicModule> to support distributed parallelism such as data parallel and tensor parallel.
  • Add MXFP8, NVFP4 quantized ONNX export support.
  • Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.

ModelOpt 0.27.1 Release

15 Apr 18:24

Choose a tag to compare

Add experimental quantization support for Llama4, QwQ and Qwen MOE models.

ModelOpt 0.27.0 Release

03 Apr 05:24

Choose a tag to compare

Deprecations

  • Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

  • New model support in the llm_ptq example: OpenAI Whisper.
  • Blockwise FP8 quantization support in unified model export.
  • Add quantization support to the Transformer Engine Linear module.
  • Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
  • To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
  • Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
  • Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
  • Add option to simplify ONNX model before quantization is performed.
  • (Experimental) Improve support for ONNX models with custom TensorRT op:
    • Add support for --calibration_shapes flag.
    • Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

  • Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.