🦙 PersonalLM

This repo helps you provision a personal and private LLM inference endpoint on Google Cloud Run GPUs. The endpoint is OpenAI and LangChain-compatible, allows for authentication via API key, and can be used as a drop-in substitute for providers who support these standards.

Once deployed, it requires no infrastructure management and scales down to zero instances when not in use. This makes it suitable for developing projects where privacy is an important consideration.

This project extends Google's official guide by adding a proxy server that runs in the Cloud Run instance, which handles auth and forwards requests to a concurrently running Ollama instance. This means that you can serve any model from Ollama's registry in theory, though in practice caps on Cloud Run resources (for memory, currently 32 Gibibytes) limit effective model size. See the model customization section below for more details.

🏎️ Quickstart

Setting up Google Cloud resources

Note

The initial setup for this project is the same as the official Cloud Run guide here.

If you don't already have a Google Cloud account, you will first need to sign up.

Navigate to the Google Cloud project selector and select or create a Google Cloud project. You will need to enabled billing for the project, since GPUs are currently not part of Google Cloud's free tier.

Next, you must enable access to Artifact Registry, Cloud Build, Cloud Run, and Cloud Storage APIs for your project. Click here and select your newly created project, then follow the instructions to do so.

GPUs are not part of the default project quota, so you will need to submit a quota increase request. From this page, select your project, then filter by Total Nvidia L4 GPU allocation without zonal redundancy, per project per region in the search bar. Find your desired region (Google currently recommends europe-west1, note that pricing may vary depending on region), then click the side menu and press Edit quota:

Enter a value (e.g. 5), and submit a request. Google claims that increase requests may take a few days to process, but you may receive an approval email almost immediately in practice.

Finally, you will need to set up proper IAM permissions for your project. Navigate to this page and select your project, then press Grant Access. In the resulting modal, paste the following permissions into the filter window and add them one by one to a principal on your project:

roles/artifactregistry.admin
roles/cloudbuild.builds.editor
roles/run.admin
roles/resourcemanager.projectIamAdmin
roles/iam.serviceAccountUser
roles/serviceusage.serviceUsageConsumer
roles/storage.admin

At the end, your screen should look something like this:

Deploying your endpoint

Now, clone this repo if you haven't already and switch your working directory to be the cloned folder:

git clone https://github.com/jacoblee93/personallm.git
cd personallm

Rename the .env.example file to .env. Run something similar to the following command to randomly generate an API key:

openssl rand -base64 32

Paste this value into the API_KEYS field. You can provide multiple API keys by comma separating them here, so make sure that none of your key values contain commas.

Install and initialize the gcloud CLI if you haven't already by following these instructions. If you already have the CLI installed, you may need to run gcloud components update to make sure you are on the latest CLI version.

Next, set your gcloud CLI project to be your project name:

gcloud config set project YOUR_PROJECT_NAME

And set the region to be the same one as where you requested GPU quota:

gcloud config set run/region YOUR_REGION

Finally, run the following command to deploy your new inference endpoint!

gcloud run deploy personallm \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600

When prompted with something like Allow unauthenticated invocations to [personallm] (y/N)?, you should respond with y. The internal proxy will handle authentication, and we want our endpoint to be reachable from anywhere for ease of use.

Note that deployments are quite slow since model weights are bundled directly into the Dockerfile - expect this step to take upwards of 20 minutes. Once it finishes, your terminal should print a Service URL, and that's it! You now have a personal, private LLM inference endpoint!

💪 Trying it out

You can call your endpoint in a similar way to how you'd call an OpenAI model, only using your generated API key and your provisioned endpoint. Here are some examples:

OpenAI Python SDK

uv add openai

from openai import OpenAI

# Note the /v1 suffix
client = OpenAI(
    base_url="https://YOUR_SERVICE_URL/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3:14b",
    messages=[
      {"role": "user", "content": "What is 2 + 2?"}
    ]
)

See OpenAI's SDK docs for examples of advanced features such as function/tool calling.

LangChain

uv add langchain-ollama

from langchain_ollama import ChatOllama

model = ChatOllama(
    model="qwen3:14b",
    base_url="https://YOUR_SERVICE_URL",
    client_kwargs={
      "headers": {
        "Authorization": "Bearer YOUR_API_KEY"
      }
    }
)

response = model.invoke("What is 2 + 2?")

See LangChain's docs for examples of advanced features such as function/tool calling.

OpenAI JS SDK

npm install openai

import OpenAI from "openai";

// Note the /v1 suffix
const client = new OpenAI({
  baseURL: "https://YOUR_SERVICE_URL/v1",
  apiKey: "YOUR_API_KEY",
});

const result = await client.chat.completions.create({
  model: "qwen3:14b",
  messages: [{ role: "user", content: "What is 2 + 2?" }],
});

See OpenAI's SDK docs for examples of advanced features such as function/tool calling.

LangChain.js

npm install @langchain/ollama @langchain/core

import { ChatOllama } from "@langchain/ollama";

const model = new ChatOllama({
  model: "qwen3:14b",
  baseUrl: "https://YOUR_SERVICE_URL",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
  },
});
const result = await model.invoke("What is 2 + 2?");

See LangChain's docs for examples of advanced features such as function/tool calling.

Latency

Keep in mind that there will be additional cold start latency if the endpoint has not been used in some time.

🔧 Model customization

The base configuration in this repo serves a 14 billion parameter model (Qwen 3) clocked at ~20-25 output tokens per second. This model is quite capable and also supports function/tool calling, which makes it more useful when building agentic flows, but if speed becomes a concern you might try smaller models such as Google's Gemma 3. You can also run the popular DeepSeek-R1 if you do not need tool calling.

To customize the served model, open your Dockerfile and modify the ENV MODEL qwen3:14b line to be a different model from Ollama's registry:

# Store the model weights in the container image
# ENV MODEL gemma3:4b
# ENV MODEL deepseek-r1:14b
ENV MODEL qwen3:14b

Note that you will also have to change your clientside code to specify the new model as a parameter.

🙏 Thank you!

If you have any questions or comments, please open an issue on this repo. You can also reach me @Hacubu on X (formerly Twitter).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
static/img		static/img
.env.example		.env.example
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
server.ts		server.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦙 PersonalLM

📋 Table of contents

🏎️ Quickstart

Setting up Google Cloud resources

Deploying your endpoint

💪 Trying it out

OpenAI Python SDK

LangChain

OpenAI JS SDK

LangChain.js

Latency

🔧 Model customization

🙏 Thank you!

About

Uh oh!

Releases

Packages

Languages

License

jacoblee93/personallm

Folders and files

Latest commit

History

Repository files navigation

🦙 PersonalLM

📋 Table of contents

🏎️ Quickstart

Setting up Google Cloud resources

Deploying your endpoint

💪 Trying it out

OpenAI Python SDK

LangChain

OpenAI JS SDK

LangChain.js

Latency

🔧 Model customization

🙏 Thank you!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages