Skip to content

Commit c589e33

Browse files
Merge branch 'main' into patch-3
2 parents bcab0c5 + 6130529 commit c589e33

File tree

146 files changed

+6141
-3787
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

146 files changed

+6141
-3787
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
/python/sglang/srt/eplb @fzyzcjy
1313
/python/sglang/srt/function_call @CatherineSue @JustinTong0323
1414
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw @ch-wan @BBuf @kushanam @Edwardf0t1
15-
/python/sglang/srt/layers/attention @ping1jing2
15+
/python/sglang/srt/layers/attention/ascend_backend.py @ping1jing2
1616
/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang
1717
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
1818
/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann

.github/workflows/ci-monitor.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,7 @@ name: CI Monitor
22

33
on:
44
schedule:
5-
# Run every 6 hours at 00:00, 06:00, 12:00, 18:00 UTC
6-
- cron: '0 */6 * * *'
5+
- cron: '0 */12 * * *'
76
workflow_dispatch:
87
inputs:
98
limit:
@@ -16,6 +15,10 @@ concurrency:
1615
group: ci-monitor-${{ github.ref }}
1716
cancel-in-progress: true
1817

18+
permissions:
19+
contents: write
20+
actions: read
21+
1922
jobs:
2023
ci-monitor:
2124
if: github.repository == 'sgl-project/sglang'|| github.event_name == 'pull_request'
@@ -50,7 +53,7 @@ jobs:
5053
PYTHONIOENCODING: utf-8
5154
run: |
5255
cd scripts/ci_monitor
53-
python ci_analyzer_perf.py --token $GITHUB_TOKEN --limit 500 --output-dir performance_tables_$(date +%Y%m%d_%H%M%S)
56+
python ci_analyzer_perf.py --token $GITHUB_TOKEN --limit ${{ github.event.inputs.limit || '1000' }} --output-dir performance_tables_$(date +%Y%m%d_%H%M%S) --upload-to-github
5457
5558
- name: Upload Analysis Results
5659
uses: actions/upload-artifact@v4

.github/workflows/pr-test.yml

Lines changed: 45 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,6 @@ jobs:
9999
needs: [check-changes, sgl-kernel-build-wheels]
100100
if: needs.check-changes.outputs.sgl_kernel == 'true'
101101
runs-on: 1-gpu-runner
102-
env:
103-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
104102
steps:
105103
- uses: actions/checkout@v4
106104

@@ -155,6 +153,50 @@ jobs:
155153
cd test/srt
156154
python3 test_mla_deepseek_v3.py
157155
156+
sgl-kernel-benchmark-test:
157+
needs: [check-changes, sgl-kernel-build-wheels]
158+
if: always() && !failure() && !cancelled()
159+
runs-on: 1-gpu-runner
160+
env:
161+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
162+
CI: true
163+
steps:
164+
- uses: actions/checkout@v4
165+
166+
- name: Cleanup
167+
run: |
168+
ls -alh sgl-kernel/dist || true
169+
rm -rf sgl-kernel/dist/* || true
170+
171+
- name: Download artifacts
172+
uses: actions/download-artifact@v4
173+
with:
174+
path: sgl-kernel/dist/
175+
merge-multiple: true
176+
pattern: wheel-python3.10-cuda12.9
177+
178+
- name: Install dependencies
179+
run: |
180+
CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/ci_install_dependency.sh
181+
182+
- name: Run benchmark tests
183+
timeout-minutes: 45
184+
run: |
185+
cd sgl-kernel/benchmark
186+
echo "Running sgl-kernel benchmark tests in CI mode..."
187+
188+
echo "CI environment variable: $CI"
189+
echo "GITHUB_ACTIONS environment variable: $GITHUB_ACTIONS"
190+
191+
for bench_file in bench_*.py; do
192+
echo "Testing $bench_file..."
193+
timeout 60 python3 "$bench_file" || echo "Warning: $bench_file timed out or failed, continuing..."
194+
echo "Completed $bench_file"
195+
echo "---"
196+
done
197+
198+
echo "All benchmark tests completed!"
199+
158200
# =============================================== primary ====================================================
159201

160202
unit-test-frontend:
@@ -189,8 +231,6 @@ jobs:
189231
if: always() && !failure() && !cancelled() &&
190232
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
191233
runs-on: 1-gpu-runner
192-
env:
193-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
194234
strategy:
195235
fail-fast: false
196236
matrix:
@@ -222,8 +262,6 @@ jobs:
222262
if: always() && !failure() && !cancelled() &&
223263
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
224264
runs-on: 2-gpu-runner
225-
env:
226-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
227265
strategy:
228266
fail-fast: false
229267
matrix:
@@ -255,8 +293,6 @@ jobs:
255293
if: always() && !failure() && !cancelled() &&
256294
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
257295
runs-on: 4-gpu-runner
258-
env:
259-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
260296
strategy:
261297
fail-fast: false
262298
matrix:
@@ -288,8 +324,6 @@ jobs:
288324
if: always() && !failure() && !cancelled() &&
289325
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
290326
runs-on: 8-gpu-runner
291-
env:
292-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
293327
strategy:
294328
fail-fast: false
295329
matrix:
@@ -321,8 +355,6 @@ jobs:
321355
if: always() && !failure() && !cancelled() &&
322356
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
323357
runs-on: 1-gpu-runner
324-
env:
325-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
326358
steps:
327359
- name: Checkout code
328360
uses: actions/checkout@v4
@@ -382,8 +414,6 @@ jobs:
382414
if: always() && !failure() && !cancelled() &&
383415
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
384416
runs-on: 1-gpu-runner
385-
env:
386-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
387417
steps:
388418
- name: Checkout code
389419
uses: actions/checkout@v4
@@ -435,8 +465,6 @@ jobs:
435465
if: always() && !failure() && !cancelled() &&
436466
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
437467
runs-on: 2-gpu-runner
438-
env:
439-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
440468
steps:
441469
- name: Checkout code
442470
uses: actions/checkout@v4
@@ -494,8 +522,6 @@ jobs:
494522
if: always() && !failure() && !cancelled() &&
495523
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
496524
runs-on: 1-gpu-runner
497-
env:
498-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
499525
steps:
500526
- name: Checkout code
501527
uses: actions/checkout@v4
@@ -526,8 +552,6 @@ jobs:
526552
if: always() && !failure() && !cancelled() &&
527553
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
528554
runs-on: 2-gpu-runner
529-
env:
530-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
531555
steps:
532556
- name: Checkout code
533557
uses: actions/checkout@v4
@@ -558,8 +582,6 @@ jobs:
558582
if: always() && !failure() && !cancelled() &&
559583
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
560584
runs-on: 4-gpu-runner
561-
env:
562-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
563585
steps:
564586
- name: Checkout code
565587
uses: actions/checkout@v4
@@ -587,8 +609,6 @@ jobs:
587609
if: always() && !failure() && !cancelled() &&
588610
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
589611
runs-on: 8-gpu-runner
590-
env:
591-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
592612
steps:
593613
- name: Checkout code
594614
uses: actions/checkout@v4
@@ -616,8 +636,6 @@ jobs:
616636
if: always() && !failure() && !cancelled() &&
617637
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
618638
runs-on: 4-b200-runner
619-
env:
620-
HF_TOKEN: ${{ secrets.HF_TOKEN }}
621639
strategy:
622640
fail-fast: false
623641
steps:
@@ -647,7 +665,7 @@ jobs:
647665
check-changes,
648666

649667
sgl-kernel-build-wheels,
650-
sgl-kernel-unit-test, sgl-kernel-mla-test,
668+
sgl-kernel-unit-test, sgl-kernel-mla-test, sgl-kernel-benchmark-test,
651669

652670
unit-test-frontend, unit-test-backend-1-gpu,
653671
unit-test-backend-2-gpu, unit-test-backend-4-gpu, unit-test-backend-8-gpu,

docker/Dockerfile.rocm

Lines changed: 90 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,13 @@ ARG LLVM_COMMIT="6520ace8227ffe2728148d5f3b9872a870b0a560"
6969
ARG MOONCAKE_REPO="https://github.com/kvcache-ai/Mooncake.git"
7070
ARG MOONCAKE_COMMIT="dcdf1c784b40aa6975a8ed89fe26321b028e40e8"
7171

72+
ARG TILELANG_REPO="https://github.com/HaiShaw/tilelang.git"
73+
ARG TILELANG_BRANCH="dsv32-mi35x"
74+
ARG TILELANG_COMMIT="ae938cf885743f165a19656d1122ad42bb0e30b8"
75+
76+
ARG FHT_REPO="https://github.com/jeffdaily/fast-hadamard-transform.git"
77+
ARG FHT_BRANCH="rocm"
78+
ARG FHT_COMMIT="46efb7d776d38638fc39f3c803eaee3dd7016bd1"
7279
USER root
7380

7481
# Install some basic utilities
@@ -90,8 +97,6 @@ RUN if [ "$BUILD_LLVM" = "1" ]; then \
9097
&& make -j$(nproc); \
9198
fi
9299

93-
# -----------------------
94-
95100
# -----------------------
96101
# AITER
97102
RUN pip uninstall -y aiter
@@ -155,7 +160,6 @@ RUN if [ "$BUILD_MOONCAKE" = "1" ]; then \
155160
make -j "$(nproc)" && make install; \
156161
fi
157162

158-
159163
# -----------------------
160164
# Build SGLang
161165
ARG BUILD_TYPE=all
@@ -207,6 +211,89 @@ RUN python3 -m pip install --no-cache-dir setuptools-rust \
207211
&& python3 -m pip install --no-cache-dir . \
208212
&& rm -rf /root/.cache
209213

214+
# -----------------------
215+
# TileLang
216+
ENV DEBIAN_FRONTEND=noninteractive
217+
ENV LIBGL_ALWAYS_INDIRECT=1
218+
RUN echo "LC_ALL=en_US.UTF-8" >> /etc/environment
219+
220+
RUN /bin/bash -lc 'set -euo pipefail; \
221+
# Build TileLang only for gfx950
222+
if [ "${GPU_ARCH:-}" != "gfx950" ]; then \
223+
echo "[TileLang] Skipping (GPU_ARCH=${GPU_ARCH:-unset})"; \
224+
exit 0; \
225+
fi; \
226+
echo "[TileLang] Building TileLang for ${GPU_ARCH}"; \
227+
\
228+
# System dependencies (NO llvm-dev to avoid llvm-config-16 shadowing)
229+
apt-get update && apt-get install -y --no-install-recommends \
230+
build-essential git wget curl ca-certificates gnupg \
231+
libgtest-dev libgmock-dev \
232+
libprotobuf-dev protobuf-compiler libgflags-dev libsqlite3-dev \
233+
python3 python3-dev python3-setuptools python3-pip \
234+
gcc libtinfo-dev zlib1g-dev libedit-dev libxml2-dev \
235+
cmake ninja-build pkg-config libstdc++6 \
236+
&& rm -rf /var/lib/apt/lists/*; \
237+
\
238+
# Build GoogleTest static libs (Ubuntu package ships sources only)
239+
cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release && \
240+
cmake --build /tmp/build-gtest -j"$(nproc)" && \
241+
cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/ && \
242+
rm -rf /tmp/build-gtest; \
243+
\
244+
# Keep setuptools < 80 (compat with base image)
245+
python3 -m pip install --upgrade "setuptools>=77.0.3,<80" wheel cmake ninja && \
246+
python3 -m pip cache purge || true; \
247+
\
248+
# Locate ROCm llvm-config; fallback to installing LLVM 18 if missing
249+
LLVM_CONFIG_PATH=""; \
250+
for p in /opt/rocm/llvm/bin/llvm-config /opt/rocm/llvm-*/bin/llvm-config /opt/rocm-*/llvm*/bin/llvm-config; do \
251+
if [ -x "$p" ]; then LLVM_CONFIG_PATH="$p"; break; fi; \
252+
done; \
253+
if [ -z "$LLVM_CONFIG_PATH" ]; then \
254+
echo "[TileLang] ROCm llvm-config not found; installing LLVM 18..."; \
255+
curl -fsSL https://apt.llvm.org/llvm.sh -o /tmp/llvm.sh; \
256+
chmod +x /tmp/llvm.sh; \
257+
/tmp/llvm.sh 18; \
258+
LLVM_CONFIG_PATH="$(command -v llvm-config-18)"; \
259+
if [ -z "$LLVM_CONFIG_PATH" ]; then echo "ERROR: llvm-config-18 not found after install"; exit 1; fi; \
260+
fi; \
261+
echo "[TileLang] Using LLVM_CONFIG at: $LLVM_CONFIG_PATH"; \
262+
export PATH="$(dirname "$LLVM_CONFIG_PATH"):/usr/local/bin:${PATH}"; \
263+
export LLVM_CONFIG="$LLVM_CONFIG_PATH"; \
264+
\
265+
# Optional shim for tools that expect llvm-config-16
266+
mkdir -p /usr/local/bin && \
267+
printf "#!/usr/bin/env bash\nexec \"%s\" \"\$@\"\n" "$LLVM_CONFIG_PATH" > /usr/local/bin/llvm-config-16 && \
268+
chmod +x /usr/local/bin/llvm-config-16; \
269+
\
270+
# TVM Python bits need Cython
271+
python3 -m pip install --no-cache-dir "cython>=0.29.36,<3.0"; \
272+
\
273+
# Clone + pin TileLang (bundled TVM), then build
274+
git clone --recursive --branch "${TILELANG_BRANCH}" "${TILELANG_REPO}" /opt/tilelang && \
275+
cd /opt/tilelang && \
276+
git fetch --depth=1 origin "${TILELANG_COMMIT}" || true && \
277+
git checkout -f "${TILELANG_COMMIT}" && \
278+
git submodule update --init --recursive && \
279+
export CMAKE_ARGS="-DLLVM_CONFIG=${LLVM_CONFIG} ${CMAKE_ARGS:-}" && \
280+
bash ./install_rocm.sh'
281+
282+
# -----------------------
283+
# Hadamard-transform (HIP build)
284+
RUN /bin/bash -lc 'set -euo pipefail; \
285+
git clone --branch "${FHT_BRANCH}" "${FHT_REPO}" fast-hadamard-transform; \
286+
cd fast-hadamard-transform; \
287+
git checkout -f "${FHT_COMMIT}"; \
288+
python setup.py install'
289+
290+
# -----------------------
291+
# Python tools
292+
RUN python3 -m pip install --no-cache-dir \
293+
py-spy \
294+
pre-commit
295+
296+
# -----------------------
210297
# Performance environment variable.
211298
ENV HIP_FORCE_DEV_KERNARG=1
212299
ENV HSA_NO_SCRATCH_RECLAIM=1

docs/advanced_features/hyperparameter_tuning.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ The case of a server being too conservative can happen when users send many requ
2323

2424
On the other hand, if you see `token usage` very high and you frequently see warnings like
2525
`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
26-
If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay.
26+
If you see `KV cache pool is full. Retract requests.` occasionally but not frequently (~1 time per minute), it is okay.
2727

2828
### Tune `--mem-fraction-static` to increase KV cache pool capacity
2929
SGLang allocates memory as follows:

docs/advanced_features/server_arguments.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
113113
| `--quantization` | The quantization method. | None |
114114
| `--quantization-param-path` | Path to the JSON file containing the KV cache scaling factors. This should generally be supplied, when KV cache dtype is FP8. Otherwise, KV cache scaling factors default to 1.0, which may cause accuracy issues. | None |
115115
| `--kv-cache-dtype` | Data type for kv cache storage. 'auto' will use model data type. 'fp8_e5m2' and 'fp8_e4m3' is supported for CUDA 11.8+. | auto |
116+
| `--enable-fp32-lm-head` | If set, the LM head outputs (logits) are in FP32. | False |
116117
117118
## Memory and scheduling
118119

docs/references/faq.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,20 @@ If you encounter out-of-memory (OOM) errors, you can adjust the following parame
99

1010
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
1111
- If OOM occurs during decoding, try lowering `--max-running-requests`.
12-
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
12+
- You can also decrease `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
1313
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
1414

1515
### CUDA Error: Illegal Memory Access Encountered
1616
This error may result from kernel errors or out-of-memory issues:
1717
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
1818
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
1919

20+
### The server hangs
21+
- If the server hangs during initialization or running, it can be memory issues (out of memory), network issues (nccl errors), or other bugs in sglang.
22+
- If it is out of memory, you might see that `avail mem` is very low during the initialization or right after initialization. In this case,
23+
you can try to decrease `--mem-fraction-static`, decrease `--cuda-graph-max-bs`, or decrease `--chunked-prefill-size`.
24+
- Other bugs, please raise a Github issue to us.
25+
2026

2127
## Frequently Asked Questions
2228

@@ -28,8 +34,6 @@ From our initial investigation, this indeterminism arises from two factors: dyna
2834

2935
To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting.
3036

31-
We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower.
32-
33-
We have two issues to track our progress:
34-
- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729).
35-
- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335).
37+
**Note**:
38+
Recently, we also introduced a deterministic mode, you can enable it with `--enable-deterministic-inference`. It might not work for all cases.
39+
Please find more details in this blog post: https://lmsys.org/blog/2025-09-22-sglang-deterministic/

0 commit comments

Comments
 (0)