Skip to content

Commit 2f2fe97

Browse files
authored
[cherry-pick][serve][docs] ray serve llm example (#56340)
cherrypick #55819 and #56287
1 parent 8e3dec3 commit 2f2fe97

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+5018
-6
lines changed

.vale/styles/config/vocabularies/General/accept.txt

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ autoscales
3333
bool
3434
breakpoint
3535
BTS
36+
bursty
3637
chatbot
3738
CLI
3839
configs
@@ -45,8 +46,10 @@ deserialize
4546
deserializes
4647
dev
4748
dev to prod
48-
disable
49+
[d|D]isable[d]
50+
[d|D]isable
4951
DLinear
52+
Dockerfile
5053
DPO
5154
EKS
5255
ETDataset
@@ -69,13 +72,15 @@ LMs
6972
LSH
7073
MCP
7174
Megatron
75+
Mixtral
7276
MLflow
7377
MLOps
7478
namespace
7579
NER
7680
Nsight
7781
NumPy
7882
NVIDIA
83+
NVLink
7984
OOM
8085
open-source
8186
PACK
@@ -86,6 +91,8 @@ pretraining
8691
productionize
8792
Pythonic
8893
QPS
94+
Qwen
95+
Quantizing
8996
retrigger
9097
RISECamp
9198
RLHF
@@ -104,6 +111,7 @@ teardown
104111
uncaptured
105112
URI(s)?
106113
UUID
114+
USD
107115
uv
108116
verl
109117
VM(s)?

doc/BUILD.bazel

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -606,3 +606,9 @@ filegroup(
606606
srcs = glob(["source/ray-overview/examples/**/*.yaml"]),
607607
visibility = ["//release:__pkg__"],
608608
)
609+
610+
filegroup(
611+
name = "deployment_serve_llm_example_configs",
612+
srcs = glob(["source/serve/tutorials/deployment-serve-llm/**/*.yaml"]),
613+
visibility = ["//release:__pkg__"],
614+
)

doc/source/conf.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,8 @@ def __init__(self, version: str):
228228
"data/api/ray.data.*.rst",
229229
"ray-overview/examples/**/README.md", # Exclude .md files in examples subfolders
230230
"train/examples/**/README.md",
231+
"serve/tutorials/deployment-serve-llm/README.*",
232+
"serve/tutorials/deployment-serve-llm/*/notebook.ipynb",
231233
] + autogen_files
232234

233235
# If "DOC_LIB" is found, only build that top-level navigation item.

doc/source/serve/examples.yml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,54 @@ examples:
7474
- natural language processing
7575
link: tutorials/serve-deepseek
7676
related_technology: llm applications
77+
- title: Deploying a small-sized LLM
78+
skill_level: beginner
79+
use_cases:
80+
- generative ai
81+
- large language models
82+
- natural language processing
83+
link: tutorials/deployment-serve-llm/small-size-llm/README
84+
related_technology: llm applications
85+
- title: Deploying a medium-sized LLM
86+
skill_level: beginner
87+
use_cases:
88+
- generative ai
89+
- large language models
90+
- natural language processing
91+
link: tutorials/deployment-serve-llm/medium-size-llm/README
92+
related_technology: llm applications
93+
- title: Deploying a large-sized LLM
94+
skill_level: beginner
95+
use_cases:
96+
- generative ai
97+
- large language models
98+
- natural language processing
99+
link: tutorials/deployment-serve-llm/large-size-llm/README
100+
related_technology: llm applications
101+
- title: Deploying a vision LLM
102+
skill_level: beginner
103+
use_cases:
104+
- generative ai
105+
- large language models
106+
- natural language processing
107+
link: tutorials/deployment-serve-llm/vision-llm/README
108+
related_technology: llm applications
109+
- title: Deploying a reasoning LLM
110+
skill_level: beginner
111+
use_cases:
112+
- generative ai
113+
- large language models
114+
- natural language processing
115+
link: tutorials/deployment-serve-llm/reasoning-llm/README
116+
related_technology: llm applications
117+
- title: Deploying a hybrid reasoning LLM
118+
skill_level: beginner
119+
use_cases:
120+
- generative ai
121+
- large language models
122+
- natural language processing
123+
link: tutorials/deployment-serve-llm/hybrid-reasoning-llm/README
124+
related_technology: llm applications
77125
- title: Serve a Chatbot with Request and Response Streaming
78126
skill_level: intermediate
79127
use_cases:

doc/source/serve/tutorials/BUILD.bazel

Lines changed: 0 additions & 5 deletions
This file was deleted.
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "bc12c0d2",
6+
"metadata": {},
7+
"source": [
8+
"# Quickstarts for LLM serving\n",
9+
"\n",
10+
"These guides provide a fast path to serving LLMs using Ray Serve on Anyscale, with focused tutorials for different deployment scales, from single-GPU setups to multi-node clusters.\n",
11+
"\n",
12+
"Each tutorial includes development and production setups, tips for configuring your cluster, and guidance on monitoring and scaling with Ray Serve.\n",
13+
"\n",
14+
"## Tutorial categories\n",
15+
"\n",
16+
"**[Small-sized LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html)** \n",
17+
"Deploy small-sized models on a single GPU, such as Llama 3 8 B, Mistral 7 B, or Phi-2. \n",
18+
"\n",
19+
"---\n",
20+
"\n",
21+
"**[Medium-sized LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/medium-size-llm/README.html)** \n",
22+
"Deploy medium-sized models using tensor parallelism across 4—8 GPUs on a single node, such as Llama 3 70 B, Qwen 14 B, Mixtral 8x7 B. \n",
23+
"\n",
24+
"---\n",
25+
"\n",
26+
"**[Large-sized LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/large-size-llm/README.html)** \n",
27+
"Deploy massive models using pipeline parallelism across a multi-node cluster, such as Deepseek-R1 or Llama-Nemotron-253 B. \n",
28+
"\n",
29+
"---\n",
30+
"\n",
31+
"**[Vision LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/vision-llm/README.html)** \n",
32+
"Deploy models with image and text input such as Qwen 2.5-VL-7 B-Instruct, MiniGPT-4, or Pixtral-12 B. \n",
33+
"\n",
34+
"---\n",
35+
"\n",
36+
"**[Reasoning LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/reasoning-llm/README.html)** \n",
37+
"Deploy models with reasoning capabilities designed for long-context tasks, coding, or tool use, such as QwQ-32 B. \n",
38+
"\n",
39+
"---\n",
40+
"\n",
41+
"**[Hybrid thinking LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/hybrid-reasoning-llm/README.html)** \n",
42+
"Deploy models that can switch between reasoning and non-reasoning modes for flexible usage, such as Qwen-3."
43+
]
44+
}
45+
],
46+
"metadata": {
47+
"language_info": {
48+
"name": "python"
49+
},
50+
"myst": {
51+
"front_matter": {
52+
"orphan": true
53+
}
54+
}
55+
},
56+
"nbformat": 4,
57+
"nbformat_minor": 5
58+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
<!--
2+
Do not modify this README. This file is a copy of the notebook and is not used to display the content.
3+
Modify notebook.ipynb instead, then regenerate this file with:
4+
jupyter nbconvert "$notebook.ipynb" --to markdown --output "README.md"
5+
-->
6+
7+
# Quickstarts for LLM serving
8+
9+
These guides provide a fast path to serving LLMs using Ray Serve on Anyscale, with focused tutorials for different deployment scales, from single-GPU setups to multi-node clusters.
10+
11+
Each tutorial includes development and production setups, tips for configuring your cluster, and guidance on monitoring and scaling with Ray Serve.
12+
13+
## Tutorial categories
14+
15+
**[Small-sized LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html)**
16+
Deploy small-sized models on a single GPU, such as Llama 3 8&nbsp;B, Mistral 7&nbsp;B, or Phi-2.
17+
18+
---
19+
20+
**[Medium-sized LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/medium-size-llm/README.html)**
21+
Deploy medium-sized models using tensor parallelism across 4—8 GPUs on a single node, such as Llama 3 70&nbsp;B, Qwen 14&nbsp;B, Mixtral 8x7&nbsp;B.
22+
23+
---
24+
25+
**[Large-sized LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/large-size-llm/README.html)**
26+
Deploy massive models using pipeline parallelism across a multi-node cluster, such as Deepseek-R1 or Llama-Nemotron-253&nbsp;B.
27+
28+
---
29+
30+
**[Vision LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/vision-llm/README.html)**
31+
Deploy models with image and text input such as Qwen 2.5-VL-7&nbsp;B-Instruct, MiniGPT-4, or Pixtral-12&nbsp;B.
32+
33+
---
34+
35+
**[Reasoning LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/reasoning-llm/README.html)**
36+
Deploy models with reasoning capabilities designed for long-context tasks, coding, or tool use, such as QwQ-32&nbsp;B.
37+
38+
---
39+
40+
**[Hybrid thinking LLM deployment](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/hybrid-reasoning-llm/README.html)**
41+
Deploy models that can switch between reasoning and non-reasoning modes for flexible usage, such as Qwen-3.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
2+
region: us-west-2
3+
4+
# Head node
5+
head_node_type:
6+
name: head
7+
instance_type: m5.2xlarge
8+
resources:
9+
cpu: 8
10+
11+
# Worker nodes
12+
auto_select_worker_config: true
13+
flags:
14+
allow-cross-zone-autoscaling: true
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
#!/bin/bash
2+
3+
set -exo pipefail
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
2+
region: us-central1
3+
4+
# Head node
5+
head_node_type:
6+
name: head
7+
instance_type: n2-standard-8
8+
resources:
9+
cpu: 8
10+
11+
# Worker nodes
12+
auto_select_worker_config: true
13+
flags:
14+
allow-cross-zone-autoscaling: true

0 commit comments

Comments
 (0)