Skip to content

Conversation

falconlee236
Copy link
Contributor

Add complete guide for vLLM Production Stack deployment on GKE with Terraform

This PR adds comprehensive documentation for deploying a GPU-accelerated vLLM Production Stack on Google Kubernetes Engine (GKE) using Terraform. The implementation creates a production-ready infrastructure with specialized node pools for ML workloads and management services.

Key Features

  • Complete GKE Infrastructure: Configures a GKE cluster with regular release channel, comprehensive logging/monitoring, VPC-native networking, and managed Prometheus
  • Specialized Node Pools:
    • GPU-accelerated nodes with NVIDIA L4 GPUs (G2-standard-8 instances)
    • Cost-effective management nodes (E2-standard-4 instances)
  • vLLM Stack Deployment: Includes NVIDIA Device Plugin and vLLM with OpenAI-compatible API endpoints
  • Automated Deployment: Makefile with commands for easy deployment and cleanup
  • Comprehensive Testing: Includes instructions for testing model availability and running inference
  • Troubleshooting Guide: Detailed troubleshooting section with helpful kubectl commands

Project Structure

gke/
├── credentials.json
├── gke-infrastructure/
│   ├── backend.tf
│   ├── cluster.tf
│   ├── node_pools.tf
│   ├── outputs.tf
│   ├── providers.tf
│   └── variables.tf
├── Makefile
├── production-stack/
│   ├── backend.tf
│   ├── helm.tf
│   ├── production_stack_specification.yaml
│   ├── providers.tf
│   └── variables.tf
└── README.md

Testing Done

  • Successfully deployed the complete stack on GKE
    image
  • Verified GPU detection and utilization
  • Tested model inference with the facebook/opt-125m model
    image
  • Validated automatic scaling of the infrastructure

Considerations for Reviewers

  • This implementation requires increased GPU quota on GCP (explicit quota increase request required)
  • Cost management considerations are included to help users minimize expenses when not actively using the infrastructure

FIX #172

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

@falconlee236
Copy link
Contributor Author

Hi @Hanchenli @YuhanLiu11

production-stack with GKE Terraform integration is finished just now.
Feel free to feedback this PR.

I am curious about this project and all about vllm's projects.
So I am going to add following Observability stack features after this PR Merged.

I have one more question about this project.
Q : How is the progress of this content going?

we plan to write a blog on Terraform in the upcoming month so a rough estimate would be very helpful!
we haven't decided what the specific blog content is. Do you want to get involved in the blog writing as well?

#172 (comment)
#172 (comment)

@YuhanLiu11
Copy link
Collaborator

hey @falconlee236 Thanks so much for your contribution! This is awesome! will take a look soon.
@Hanchenli, can you update on the plans for the blog on Terraform?

Signed-off-by: falconlee236 <[email protected]>
@falconlee236
Copy link
Contributor Author

falconlee236 commented Mar 9, 2025

@YuhanLiu11
Add observability stack code in production-stack/helm.tf just now.

so

I am curious about this project and all about vllm's projects.
So I am going to add following Observability stack features after this PR Merged.

this comments is resolved

Copy link
Collaborator

@Hanchenli Hanchenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @falconlee236, thank you for submitting this PR. This is AWESOME!!! Could you add in the readme about some ways to customize the cluster (num of GPUs, GPU type, model type...)?

# This command provisions the GKE cluster, node pools, and deploys the vLLM stack in one step

# Deploy just the GKE infrastructure
make create-gke-infra
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add in the Readme about choosing the GPUs I want to use in the beginning or the model I want to deploy?

Copy link
Contributor Author

@falconlee236 falconlee236 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hanchenli
Add <🎮 GPU and Model Selection> Section in READEME.md

and add gpu variable to custom settings in gke-infrastructure/variables.tf

@Hanchenli
Copy link
Collaborator

Hi @falconlee236, we are waiting on the AWS terraform tutorial to be finished. Do you want to talk about the blog some day? Are you in LMCache or vLLM channel? Can you send me your user name in the channel and I will send you a message to chat about it.

@falconlee236
Copy link
Contributor Author

falconlee236 commented Mar 9, 2025

Hi @falconlee236, we are waiting on the AWS terraform tutorial to be finished. Do you want to talk about the blog some day? Are you in LMCache or vLLM channel? Can you send me your user name in the channel and I will send you a message to chat about it.

Hi @Hanchenli
What's the behavioral difference between vLLM and LMCache channel?
I just send request to join vLLM channel

My username in requested channel is Sangyun Lee. THX

WIP) add settings about gpus and machine type of gcp until today.

@falconlee236 falconlee236 changed the title [FEAT]: Terraform Quickstart Tutorials for Google GKE [FEAT] Terraform Quickstart Tutorials for Google GKE Mar 10, 2025
Copy link
Collaborator

@YuhanLiu11 YuhanLiu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for this great contribution!

@YuhanLiu11 YuhanLiu11 merged commit 1cbf92f into vllm-project:main Mar 12, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature: Terraform Quickstart Tutorials for Google GKE
3 participants