Enable native ModelOpt quantization support (3/3) #10154

Edwardf0t1 · 2025-09-08T07:24:54Z

This is the third PR in a three-part series to enable native ModelOpt quantization in SGLang. It includes changes from the first PR (#7149) and second PR (#9991) and will be rebased once the first two PRs are merged.

Motivation

We aim to enhance SGLang's quantization capabilities, making ModelOpt integration more robust and user-friendly while providing checkpoint persistence for better performance in production environments.

Modifications

Integrated modelopt quantized model export functionalities.
Added modelopt_export_path parameter to _setup_modelopt_quantization() in ModelOptModelLoader.
Implemented _export_modelopt_checkpoint() method using modelopt's unified hf export API.
Added modelopt_export_path parameter in ModelConfig and added --modelopt-export-path command-line argument in ServerArgs.
Export happens automatically after quantization (or when restoring from checkpoint).
Added unit tests for the export functionalities.
Unified quantization flags in quantize + export and deployment phases.
Added an example script to run modelopt quantize + export + deployment.
TODO: Enable a quantize-and-serve mode for quantize + export + deployment with a single command.

Accuracy Tests

Production Workflow:

# Step 1: Quantize + Export
python examples/usage/modelopt_quantize_and_export.py quantize \
    --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --export-dir ./quantized_tinyllama_fp8 \
    --quantization-method modelopt_fp8

# Step 2: Deploy
python -m sglang.launch_server \
    --model-path ./quantized_tinyllama_fp8 \
    --quantization modelopt

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Summary by CodeRabbit

New Features
- Added NVIDIA ModelOpt quantization support (FP8/FP4 auto-detection), export to Hugging Face format, and serving of exported models.
- Introduced CLI options to export after quantization and to quantize-and-serve.
- Added quantization choice: modelopt_fp8.
- Included an example script demonstrating quantize, export, and deploy.
Documentation
- New guide “Using NVIDIA ModelOpt” covering installation, workflow, Python usage, deployment, and advanced features; reference updated.
Tests
- Expanded coverage for ModelOpt workflows and additional model/attention components.
Chores
- Added optional dependency group for ModelOpt.

Edwardf0t1 · 2025-09-13T00:31:22Z

@zhyncs @Qiaolin-Yu Please help or find someone review this PR as well when you get a chance. Thank you!

test/srt/test_modelopt_loader.py

python/sglang/srt/configs/model_config.py

examples/usage/modelopt_quantize_and_export.py

python/sglang/srt/model_loader/loader.py

Edwardf0t1 · 2025-09-26T08:03:05Z

python/pyproject.toml


 [project.optional-dependencies]
 decord = ["decord"]
+modelopt = ["nvidia-modelopt"]


@zhyncs @Qiaolin-Yu Please let us know if it's okay to add modelopt as an optional dependency, or required dependency?

cc @Ying1123 @merrymercy

Signed-off-by: Zhiyu Cheng <[email protected]>

python/sglang/srt/configs/model_config.py

test/srt/run_suite.py

python/pyproject.toml

Signed-off-by: Zhiyu Cheng <[email protected]>

JustinTong0323 · 2025-10-17T06:32:08Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+    @classmethod
+    def override_quantization_method(cls, hf_quant_config, user_quant):
+        """Override quantization method based on the model's config."""
+        if hf_quant_config is None:
+            return None
+
+        # Check if this is a ModelOpt config
+        quant_algo = hf_quant_config.get("quant_algo", "").upper()
+
+        # If user specified generic "modelopt", auto-detect the specific method
+        if user_quant == "modelopt":
+            if "FP8" in quant_algo:
+                return "modelopt_fp8"
+            elif "NVFP4" in quant_algo or "FP4" in quant_algo:
+                return "modelopt_fp4"
+
+        return None


Why do we have the exact same duplicate codes?

python/sglang/srt/configs/model_config.py

JustinTong0323 · 2025-10-17T06:36:52Z

python/sglang/srt/configs/model_config.py

+    def _is_already_quantized(self) -> bool:
+        """Check if the model is already quantized based on config files."""
+        # Check for HuggingFace quantization config
+        if is_remote_url(self.model_path):
+            try:
+                from huggingface_hub import HfApi
+
+                hf_api = HfApi()
+                return hf_api.file_exists(self.model_path, "hf_quant_config.json")
+            except Exception:
+                return False
+        else:
+            return os.path.exists(os.path.join(self.model_path, "hf_quant_config.json"))


It basically detects whether "hf_quant_config.json" exists in model file. Iirc there are similar helper functions. May you try to find and reuse? (If not, make one and put into "utils.py" and call here

JustinTong0323 · 2025-10-17T06:38:40Z

python/sglang/srt/model_loader/loader.py

+            # Export model if path provided
+            if export_path:
+                try:
+                    # Get the original model path from the model config
+                    original_model_path = getattr(self, "_original_model_path", None)
+                    self._export_modelopt_checkpoint(
+                        model, export_path, original_model_path
+                    )
+                    rank0_log(
+                        f"Quantized model exported to HuggingFace format at {export_path}"
+                    )
+                except Exception as e:
+                    rank0_log(
+                        f"Warning: Failed to export quantized model to {export_path}: {e}"
+                    )
+


DRY: move to a helper function, like

def _maybe_export(self): if export_path: try: # Get the original model path from the model config original_model_path = getattr(self, "_original_model_path", None) self._export_modelopt_checkpoint( model, export_path, original_model_path ) rank0_log( f"Quantized model exported to HuggingFace format at {export_path}" ) except Exception as e: rank0_log( f"Warning: Failed to export quantized model to {export_path}: {e}" )

JustinTong0323 · 2025-10-17T06:42:42Z

python/sglang/srt/model_loader/loader.py

Not sure whether we could reuse loader to export the model.... cc @merrymercy

I don't quite understand this comment, but let's discuss in our call tomorrow.

Edwardf0t1 mentioned this pull request Sep 9, 2025

Enable native ModelOpt quantization support (3/3) Edwardf0t1/sglang#2

Open

4 tasks

Edwardf0t1 marked this pull request as ready for review September 9, 2025 08:12

Edwardf0t1 requested review from BBuf, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners September 9, 2025 08:12

This was referenced Sep 12, 2025

Enable native ModelOpt quantization support (1/3) #7149

Merged

Enable native ModelOpt quantization support (2/3) #9991

Merged

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from e97069f to 19fcedb Compare September 13, 2025 00:29

Edwardf0t1 requested review from CatherineSue and slin1237 as code owners September 13, 2025 00:29

Qiaolin-Yu self-assigned this Sep 13, 2025

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 19fcedb to 95fc54b Compare September 13, 2025 01:48

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 95fc54b to d25e5d1 Compare September 23, 2025 08:18

Qiaolin-Yu reviewed Sep 24, 2025

View reviewed changes

Edwardf0t1 added the high priority label Sep 26, 2025

Edwardf0t1 self-assigned this Sep 26, 2025

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from d25e5d1 to a9e4353 Compare September 26, 2025 06:25

Edwardf0t1 requested a review from JustinTong0323 as a code owner September 26, 2025 06:25

Edwardf0t1 commented Sep 26, 2025

View reviewed changes

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch 2 times, most recently from c5181b3 to 15dd13e Compare September 30, 2025 05:34

b8zhong added the run-ci label Oct 6, 2025

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 15dd13e to 9c2eaac Compare October 8, 2025 08:06

Edwardf0t1 added 19 commits October 14, 2025 08:19

resolve conflict

3786e4b

Signed-off-by: Zhiyu Cheng <[email protected]>

add unit tests for modelopt export functionalities

aaeeb1f

resolve conflict

43a8f93

Signed-off-by: Zhiyu Cheng <[email protected]>

resolve conflict

32d3657

Signed-off-by: Zhiyu Cheng <[email protected]>

resolve conflict

64399a2

Signed-off-by: Zhiyu Cheng <[email protected]>

resolve conflict

cf7f31f

Signed-off-by: Zhiyu Cheng <[email protected]>

resolve conflict

b7f353a

Signed-off-by: Zhiyu Cheng <[email protected]>

resolve conflict

93b029c

resolve conflict

622424f

Signed-off-by: Zhiyu Cheng <[email protected]>

resolve conflict

daafcf4

Signed-off-by: Zhiyu Cheng <[email protected]>

add an example script

d4cabd2

update unit tests

1926e75

avoid emojis in warning message

a584dfe

use rank0_log instead of print in loader.py

5b5fa1e

resolve conflict

6b798d9

Signed-off-by: Zhiyu Cheng <[email protected]>

update example

befb7af

Signed-off-by: Zhiyu Cheng <[email protected]>

update quantization doc page

d42eccd

Signed-off-by: Zhiyu Cheng <[email protected]>

resolve conflict

3bd5217

Signed-off-by: Zhiyu Cheng <[email protected]>

fix ci

456a3f9

Signed-off-by: Zhiyu Cheng <[email protected]>

Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 7b27705 to 456a3f9 Compare October 14, 2025 08:24

Merge branch 'main' into zhiyu/modelopt-sglang-api-3

87cb4aa

JustinTong0323 self-assigned this Oct 14, 2025

JustinTong0323 reviewed Oct 14, 2025

View reviewed changes

python/sglang/srt/configs/model_config.py Show resolved Hide resolved

JustinTong0323 reviewed Oct 14, 2025

View reviewed changes

test/srt/run_suite.py Outdated Show resolved Hide resolved

JustinTong0323 reviewed Oct 14, 2025

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

Edwardf0t1 added 4 commits October 15, 2025 06:37

minor

34a402b

Signed-off-by: Zhiyu Cheng <[email protected]>

Merge remote changes from origin-github/zhiyu/modelopt-sglang-api-3

c770880

minor

9ae9185

Signed-off-by: Zhiyu Cheng <[email protected]>

fix ci

53cb244

Signed-off-by: Zhiyu Cheng <[email protected]>

JustinTong0323 requested changes Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable native ModelOpt quantization support (3/3) #10154

Enable native ModelOpt quantization support (3/3) #10154

Edwardf0t1 commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Edwardf0t1 commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Edwardf0t1 Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 Oct 17, 2025

Uh oh!

Uh oh!

JustinTong0323 Oct 17, 2025

Uh oh!

JustinTong0323 Oct 17, 2025

Uh oh!

JustinTong0323 Oct 17, 2025

Uh oh!

Edwardf0t1 Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Enable native ModelOpt quantization support (3/3) #10154

Are you sure you want to change the base?

Enable native ModelOpt quantization support (3/3) #10154

Conversation

Edwardf0t1 commented Sep 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Summary by CodeRabbit

Uh oh!

Edwardf0t1 commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Edwardf0t1 Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JustinTong0323 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Edwardf0t1 commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

Edwardf0t1 Sep 26, 2025 •

edited

Loading