Skip to content

Conversation

Edwardf0t1
Copy link
Collaborator

@Edwardf0t1 Edwardf0t1 commented Sep 8, 2025

This is the third PR in a three-part series to enable native ModelOpt quantization in SGLang. It includes changes from the first PR (#7149) and second PR (#9991) and will be rebased once the first two PRs are merged.

Motivation

We aim to enhance SGLang's quantization capabilities, making ModelOpt integration more robust and user-friendly while providing checkpoint persistence for better performance in production environments.

Modifications

  • Integrated modelopt quantized model export functionalities.
  • Added modelopt_export_path parameter to _setup_modelopt_quantization() in ModelOptModelLoader.
  • Implemented _export_modelopt_checkpoint() method using modelopt's unified hf export API.
  • Added modelopt_export_path parameter in ModelConfig and added --modelopt-export-path command-line argument in ServerArgs.
  • Export happens automatically after quantization (or when restoring from checkpoint).
  • Added unit tests for the export functionalities.
  • Unified quantization flags in quantize + export and deployment phases.
  • Added an example script to run modelopt quantize + export + deployment.
  • TODO: Enable a quantize-and-serve mode for quantize + export + deployment with a single command.

Accuracy Tests

Production Workflow:

# Step 1: Quantize + Export
python examples/usage/modelopt_quantize_and_export.py quantize \
    --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --export-dir ./quantized_tinyllama_fp8 \
    --quantization-method modelopt_fp8

# Step 2: Deploy
python -m sglang.launch_server \
    --model-path ./quantized_tinyllama_fp8 \
    --quantization modelopt

Benchmarking and Profiling

Checklist

Summary by CodeRabbit

  • New Features

    • Added NVIDIA ModelOpt quantization support (FP8/FP4 auto-detection), export to Hugging Face format, and serving of exported models.
    • Introduced CLI options to export after quantization and to quantize-and-serve.
    • Added quantization choice: modelopt_fp8.
    • Included an example script demonstrating quantize, export, and deploy.
  • Documentation

    • New guide “Using NVIDIA ModelOpt” covering installation, workflow, Python usage, deployment, and advanced features; reference updated.
  • Tests

    • Expanded coverage for ModelOpt workflows and additional model/attention components.
  • Chores

    • Added optional dependency group for ModelOpt.

@Edwardf0t1
Copy link
Collaborator Author

@zhyncs @Qiaolin-Yu Please help or find someone review this PR as well when you get a chance. Thank you!

@Qiaolin-Yu Qiaolin-Yu self-assigned this Sep 13, 2025
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 19fcedb to 95fc54b Compare September 13, 2025 01:48
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 95fc54b to d25e5d1 Compare September 23, 2025 08:18

[project.optional-dependencies]
decord = ["decord"]
modelopt = ["nvidia-modelopt"]
Copy link
Collaborator Author

@Edwardf0t1 Edwardf0t1 Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhyncs @Qiaolin-Yu Please let us know if it's okay to add modelopt as an optional dependency, or required dependency?

cc @Ying1123 @merrymercy

@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch 2 times, most recently from c5181b3 to 15dd13e Compare September 30, 2025 05:34
@b8zhong b8zhong added the run-ci label Oct 6, 2025
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 15dd13e to 9c2eaac Compare October 8, 2025 08:06
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/modelopt-sglang-api-3 branch from 7b27705 to 456a3f9 Compare October 14, 2025 08:24
@JustinTong0323 JustinTong0323 self-assigned this Oct 14, 2025
Comment on lines +114 to +130
@classmethod
def override_quantization_method(cls, hf_quant_config, user_quant):
"""Override quantization method based on the model's config."""
if hf_quant_config is None:
return None

# Check if this is a ModelOpt config
quant_algo = hf_quant_config.get("quant_algo", "").upper()

# If user specified generic "modelopt", auto-detect the specific method
if user_quant == "modelopt":
if "FP8" in quant_algo:
return "modelopt_fp8"
elif "NVFP4" in quant_algo or "FP4" in quant_algo:
return "modelopt_fp4"

return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have the exact same duplicate codes?

Comment on lines +557 to +569
def _is_already_quantized(self) -> bool:
"""Check if the model is already quantized based on config files."""
# Check for HuggingFace quantization config
if is_remote_url(self.model_path):
try:
from huggingface_hub import HfApi

hf_api = HfApi()
return hf_api.file_exists(self.model_path, "hf_quant_config.json")
except Exception:
return False
else:
return os.path.exists(os.path.join(self.model_path, "hf_quant_config.json"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically detects whether "hf_quant_config.json" exists in model file. Iirc there are similar helper functions. May you try to find and reuse? (If not, make one and put into "utils.py" and call here

Comment on lines +1877 to +1892
# Export model if path provided
if export_path:
try:
# Get the original model path from the model config
original_model_path = getattr(self, "_original_model_path", None)
self._export_modelopt_checkpoint(
model, export_path, original_model_path
)
rank0_log(
f"Quantized model exported to HuggingFace format at {export_path}"
)
except Exception as e:
rank0_log(
f"Warning: Failed to export quantized model to {export_path}: {e}"
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY: move to a helper function, like

def _maybe_export(self):
    if export_path:
                    try:
                        # Get the original model path from the model config
                        original_model_path = getattr(self, "_original_model_path", None)
                        self._export_modelopt_checkpoint(
                            model, export_path, original_model_path
                        )
                        rank0_log(
                            f"Quantized model exported to HuggingFace format at {export_path}"
                        )
                    except Exception as e:
                        rank0_log(
                            f"Warning: Failed to export quantized model to {export_path}: {e}"
                        )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we could reuse loader to export the model.... cc @merrymercy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment, but let's discuss in our call tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants