[feat] create vector embeddings #22

carolynzhuang · 2025-10-10T10:32:35Z

What's new in this PR

Description

created a python script in actions/embeddings to:

pull adoptee data from the public.adoptee table in supabase
generate embeddings using the bio column
populate the vecs.adoptee_vector with the embedding and other adoptee data (id, gender, age, state, offense, veteran_status)

created a sql query to copy the data from vecs.adoptee_vector into public.adoptee_vector. the columns other than id and vec are in json format, so the query extracts the information and places it into the corresponding column.

runtime: the script runs pretty fast, maybe <20 seconds for setting up the clients, generating embeddings, and upsertting the data for 10 bios

Screenshots

public.adoptee_vector table:

How to review

update .env to include DATABASE_URL (in the passwords google doc)
create a python virtual environment named adopt-an-inmate-venv and activate it.
pip install requirements.txt
Run actions/embeddings/embed.py

there won't be any changes because all of the embeddings have been created and added to vecs.adoptee_vector. feel free to add additional data or edit existing data to check that the changes are reflected.

to transfer the data to the public schema, run the sql query transfer_data

Next steps

create an edge function that triggers the python script via http request
resolve schemas: since we are using the vecs python library, the adoptee_vector data is stored to a different schema than the adoptee table. right now, i've created a sql query that transfers the data across schemas, but this will need to be revisited later. i wouldn't suggest keeping the data in the vecs schema, as a good portion of the data is stored in json format due to restrictions with the vecs library. we could also consider keeping only the id and vec columns, as the other data can be retrieved via id from public.adoptee_vector.

Relevant links

Online sources

vecs documentation: https://supabase.github.io/vecs/
https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb

Related PRs

CC: @ethan-tam33

…into 16-create-vector-embeddings

ethan-tam33

great work carolyn! I think this is a great first proof of concept, but we do need to ensure that this code is scalable for when we handle the almost 4k rows of adoptee data next week. let me know if you have any questions about my requested changes

ethan-tam33 · 2025-10-10T20:07:42Z

actions/embeddings/embed.py

+  """Store the vector information in the adoptee_vector table."""
+
+  adoptee = supabase_client.table("adoptee").select("*").execute().data
+  adoptee_vector = vx.get_or_create_collection("adoptee_vector", dimension=384)


the 384 here is the dimension of the paraphrase-MiniLM-L3-v2 model embeddings right? Instead of hardcoding this value, could you create a mapping of model name to dimension size that doesn't require us to hardcode this parameter? This will help us test other models very easily in the future.

ethan-tam33 · 2025-10-10T20:10:47Z

actions/embeddings/embed.py

+
+  records = []
+
+  for row in adoptee:


can you wrap this with a tqdm? like this: https://tqdm.github.io/

ethan-tam33 · 2025-10-10T20:28:51Z

actions/embeddings/embed.py

+    records.append(((row_id, embedding, metadata)))
+
+  try:
+    adoptee_vector.upsert(records)


this approach of upserting after we generate embeddings for the entire table works for small tables, but when we scale to the entire dataset (almost 4k rows), it's likely inefficient to try to upsert all that data at once. I think this is especially important for us, since the embedding size can be big, and supabase might have some size limitations.

instead, can you add a batching functionality? basically, once the records list has length x (you can set this to your best guess for now), we will upsert that list, reset records to an empty list, and then keep getting more data. this way we will only upsert at most x records.

jinkang-0 · 2025-10-10T23:18:52Z

.gitignore

+# python
+/adopt-an-inmate-venv


I would also recommend adding a line for __pycache__.

It's generally best to not add binaries to the code repo, especially if it is generated by a package manager like pip, since it can become an attraction for merge conflicts.

jinkang-0 · 2025-10-10T23:20:28Z

actions/embeddings/__pycache__/clients.cpython-313.pyc

Should we create a separate folder for Python code?

/actions is generally used for server actions, which is a function of the application server.

In this case, this code seems more appropriate for testing or research. Perhaps /research? Or if it is intended to be deployed and executed by an edge function, perhaps something like /edge or /nlp.

carolynzhuang added 2 commits October 10, 2025 02:26

finish python script to create embeddings

6112985

Merge branch 'main' of https://github.com/calblueprint/adopt-an-inmate …

27f5fe0

…into 16-create-vector-embeddings

carolynzhuang linked an issue Oct 10, 2025 that may be closed by this pull request

Create Vector Embeddings #16

Open

carolynzhuang added 2 commits October 10, 2025 03:33

fix dotenv file path

d5cdedb

fix some other stuff

42d4529

ethan-tam33 requested changes Oct 10, 2025

View reviewed changes

add batching to embedding function

7512db3

jinkang-0 reviewed Oct 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] create vector embeddings #22

[feat] create vector embeddings #22

Uh oh!

carolynzhuang commented Oct 10, 2025 •

edited

Loading

Uh oh!

ethan-tam33 left a comment

Uh oh!

ethan-tam33 Oct 10, 2025

Uh oh!

ethan-tam33 Oct 10, 2025

Uh oh!

ethan-tam33 Oct 10, 2025

Uh oh!

jinkang-0 Oct 10, 2025

Uh oh!

jinkang-0 Oct 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# python
		/adopt-an-inmate-venv

[feat] create vector embeddings #22

Are you sure you want to change the base?

[feat] create vector embeddings #22

Uh oh!

Conversation

carolynzhuang commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's new in this PR

Description

Screenshots

How to review

Next steps

Relevant links

Online sources

Related PRs

Uh oh!

ethan-tam33 left a comment

Choose a reason for hiding this comment

Uh oh!

ethan-tam33 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ethan-tam33 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ethan-tam33 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

jinkang-0 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

jinkang-0 Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carolynzhuang commented Oct 10, 2025 •

edited

Loading

jinkang-0 Oct 10, 2025 •

edited

Loading