Skip to content

Conversation

carolynzhuang
Copy link
Contributor

@carolynzhuang carolynzhuang commented Oct 10, 2025

What's new in this PR

Description

created a python script in actions/embeddings to:

  1. pull adoptee data from the public.adoptee table in supabase
  2. generate embeddings using the bio column
  3. populate the vecs.adoptee_vector with the embedding and other adoptee data (id, gender, age, state, offense, veteran_status)

created a sql query to copy the data from vecs.adoptee_vector into public.adoptee_vector. the columns other than id and vec are in json format, so the query extracts the information and places it into the corresponding column.

runtime: the script runs pretty fast, maybe <20 seconds for setting up the clients, generating embeddings, and upsertting the data for 10 bios

Screenshots

public.adoptee_vector table:
image

How to review

  1. update .env to include DATABASE_URL (in the passwords google doc)
  2. create a python virtual environment named adopt-an-inmate-venv and activate it.
  3. pip install requirements.txt
  4. Run actions/embeddings/embed.py

there won't be any changes because all of the embeddings have been created and added to vecs.adoptee_vector. feel free to add additional data or edit existing data to check that the changes are reflected.

to transfer the data to the public schema, run the sql query transfer_data

Next steps

  • create an edge function that triggers the python script via http request
  • resolve schemas: since we are using the vecs python library, the adoptee_vector data is stored to a different schema than the adoptee table. right now, i've created a sql query that transfers the data across schemas, but this will need to be revisited later. i wouldn't suggest keeping the data in the vecs schema, as a good portion of the data is stored in json format due to restrictions with the vecs library. we could also consider keeping only the id and vec columns, as the other data can be retrieved via id from public.adoptee_vector.

Relevant links

Online sources

vecs documentation: https://supabase.github.io/vecs/
https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb

Related PRs

CC: @ethan-tam33

@carolynzhuang carolynzhuang linked an issue Oct 10, 2025 that may be closed by this pull request
Copy link
Collaborator

@ethan-tam33 ethan-tam33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work carolyn! I think this is a great first proof of concept, but we do need to ensure that this code is scalable for when we handle the almost 4k rows of adoptee data next week. let me know if you have any questions about my requested changes

"""Store the vector information in the adoptee_vector table."""

adoptee = supabase_client.table("adoptee").select("*").execute().data
adoptee_vector = vx.get_or_create_collection("adoptee_vector", dimension=384)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 384 here is the dimension of the paraphrase-MiniLM-L3-v2 model embeddings right? Instead of hardcoding this value, could you create a mapping of model name to dimension size that doesn't require us to hardcode this parameter? This will help us test other models very easily in the future.


records = []

for row in adoptee:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you wrap this with a tqdm? like this: https://tqdm.github.io/

records.append(((row_id, embedding, metadata)))

try:
adoptee_vector.upsert(records)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this approach of upserting after we generate embeddings for the entire table works for small tables, but when we scale to the entire dataset (almost 4k rows), it's likely inefficient to try to upsert all that data at once. I think this is especially important for us, since the embedding size can be big, and supabase might have some size limitations.

instead, can you add a batching functionality? basically, once the records list has length x (you can set this to your best guess for now), we will upsert that list, reset records to an empty list, and then keep getting more data. this way we will only upsert at most x records.

Comment on lines +38 to +39
# python
/adopt-an-inmate-venv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also recommend adding a line for __pycache__.

It's generally best to not add binaries to the code repo, especially if it is generated by a package manager like pip, since it can become an attraction for merge conflicts.

Copy link
Contributor

@jinkang-0 jinkang-0 Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a separate folder for Python code?

/actions is generally used for server actions, which is a function of the application server.

In this case, this code seems more appropriate for testing or research. Perhaps /research? Or if it is intended to be deployed and executed by an edge function, perhaps something like /edge or /nlp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create Vector Embeddings

3 participants