This repo helps you provision a personal and private LLM inference endpoint on Google Cloud Run GPUs. The endpoint is OpenAI and LangChain-compatible, allows for authentication via API key, and can be used as a drop-in substitute for providers who support these standards.
Once deployed, it requires no infrastructure management and scales down to zero instances when not in use. This makes it suitable for developing projects where privacy is an important consideration.
This project extends Google's official guide by adding a proxy server that runs in the Cloud Run instance, which handles auth and forwards requests to a concurrently running Ollama instance. This means that you can serve any model from Ollama's registry in theory, though in practice caps on Cloud Run resources (for memory, currently 32 Gibibytes) limit effective model size. See the model customization section below for more details.
Note
The initial setup for this project is the same as the official Cloud Run guide here.
If you don't already have a Google Cloud account, you will first need to sign up.
Navigate to the Google Cloud project selector and select or create a Google Cloud project. You will need to enabled billing for the project, since GPUs are currently not part of Google Cloud's free tier.
Next, you must enable access to Artifact Registry, Cloud Build, Cloud Run, and Cloud Storage APIs for your project. Click here and select your newly created project, then follow the instructions to do so.
GPUs are not part of the default project quota, so you will need to submit a quota increase request. From this page, select your project, then filter by Total Nvidia L4 GPU allocation without zonal redundancy, per project per region
in the search bar. Find your desired region (Google currently recommends europe-west1
, note that pricing may vary depending on region), then click the side menu and press Edit quota
:
Enter a value (e.g. 5
), and submit a request. Google claims that increase requests may take a few days to process, but you may receive an approval email almost immediately in practice.
Finally, you will need to set up proper IAM permissions for your project. Navigate to this page and select your project, then press Grant Access
. In the resulting modal, paste the following permissions into the filter window and add them one by one to a principal on your project:
roles/artifactregistry.admin
roles/cloudbuild.builds.editor
roles/run.admin
roles/resourcemanager.projectIamAdmin
roles/iam.serviceAccountUser
roles/serviceusage.serviceUsageConsumer
roles/storage.admin
At the end, your screen should look something like this:
Now, clone this repo if you haven't already and switch your working directory to be the cloned folder:
git clone https://github.com/jacoblee93/personallm.git
cd personallm
Rename the .env.example
file to .env
. Run something similar to the following command to randomly generate an API key:
openssl rand -base64 32
Paste this value into the API_KEYS
field. You can provide multiple API keys by comma separating them here, so make sure that none of your key values contain commas.
Install and initialize the gcloud
CLI if you haven't already by following these instructions. If you already have the CLI installed, you may need to run gcloud components update
to make sure you are on the latest CLI version.
Next, set your gcloud
CLI project to be your project name:
gcloud config set project YOUR_PROJECT_NAME
And set the region to be the same one as where you requested GPU quota:
gcloud config set run/region YOUR_REGION
Finally, run the following command to deploy your new inference endpoint!
gcloud run deploy personallm \
--source . \
--concurrency 4 \
--cpu 8 \
--set-env-vars OLLAMA_NUM_PARALLEL=4 \
--gpu 1 \
--gpu-type nvidia-l4 \
--max-instances 1 \
--memory 32Gi \
--no-cpu-throttling \
--no-gpu-zonal-redundancy \
--timeout=600
When prompted with something like Allow unauthenticated invocations to [personallm] (y/N)?
, you should respond with y
. The internal proxy will handle authentication, and we want our endpoint to be reachable from anywhere for ease of use.
Note that deployments are quite slow since model weights are bundled directly into the Dockerfile - expect this step to take upwards of 20 minutes. Once it finishes, your terminal should print a Service URL
, and that's it! You now have a personal, private LLM inference endpoint!
You can call your endpoint in a similar way to how you'd call an OpenAI model, only using your generated API key and your provisioned endpoint. Here are some examples:
uv add openai
from openai import OpenAI
# Note the /v1 suffix
client = OpenAI(
base_url="https://YOUR_SERVICE_URL/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="qwen3:14b",
messages=[
{"role": "user", "content": "What is 2 + 2?"}
]
)
See OpenAI's SDK docs for examples of advanced features such as function/tool calling.
uv add langchain-ollama
from langchain_ollama import ChatOllama
model = ChatOllama(
model="qwen3:14b",
base_url="https://YOUR_SERVICE_URL",
client_kwargs={
"headers": {
"Authorization": "Bearer YOUR_API_KEY"
}
}
)
response = model.invoke("What is 2 + 2?")
See LangChain's docs for examples of advanced features such as function/tool calling.
npm install openai
import OpenAI from "openai";
// Note the /v1 suffix
const client = new OpenAI({
baseURL: "https://YOUR_SERVICE_URL/v1",
apiKey: "YOUR_API_KEY",
});
const result = await client.chat.completions.create({
model: "qwen3:14b",
messages: [{ role: "user", content: "What is 2 + 2?" }],
});
See OpenAI's SDK docs for examples of advanced features such as function/tool calling.
npm install @langchain/ollama @langchain/core
import { ChatOllama } from "@langchain/ollama";
const model = new ChatOllama({
model: "qwen3:14b",
baseUrl: "https://YOUR_SERVICE_URL",
headers: {
Authorization: "Bearer YOUR_API_KEY",
},
});
const result = await model.invoke("What is 2 + 2?");
See LangChain's docs for examples of advanced features such as function/tool calling.
Keep in mind that there will be additional cold start latency if the endpoint has not been used in some time.
The base configuration in this repo serves a 14 billion parameter model (Qwen 3) clocked at ~20-25 output tokens per second. This model is quite capable and also supports function/tool calling, which makes it more useful when building agentic flows, but if speed becomes a concern you might try smaller models such as Google's Gemma 3. You can also run the popular DeepSeek-R1 if you do not need tool calling.
To customize the served model, open your Dockerfile
and modify the ENV MODEL qwen3:14b
line to be a different model from Ollama's registry:
# Store the model weights in the container image
# ENV MODEL gemma3:4b
# ENV MODEL deepseek-r1:14b
ENV MODEL qwen3:14b
Note that you will also have to change your clientside code to specify the new model as a parameter.
If you have any questions or comments, please open an issue on this repo. You can also reach me @Hacubu on X (formerly Twitter).