-
Notifications
You must be signed in to change notification settings - Fork 0
[feat] create vector embeddings #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…into 16-create-vector-embeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work carolyn! I think this is a great first proof of concept, but we do need to ensure that this code is scalable for when we handle the almost 4k rows of adoptee data next week. let me know if you have any questions about my requested changes
actions/embeddings/embed.py
Outdated
"""Store the vector information in the adoptee_vector table.""" | ||
|
||
adoptee = supabase_client.table("adoptee").select("*").execute().data | ||
adoptee_vector = vx.get_or_create_collection("adoptee_vector", dimension=384) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 384 here is the dimension of the paraphrase-MiniLM-L3-v2
model embeddings right? Instead of hardcoding this value, could you create a mapping of model name to dimension size that doesn't require us to hardcode this parameter? This will help us test other models very easily in the future.
actions/embeddings/embed.py
Outdated
|
||
records = [] | ||
|
||
for row in adoptee: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you wrap this with a tqdm? like this: https://tqdm.github.io/
actions/embeddings/embed.py
Outdated
records.append(((row_id, embedding, metadata))) | ||
|
||
try: | ||
adoptee_vector.upsert(records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this approach of upserting after we generate embeddings for the entire table works for small tables, but when we scale to the entire dataset (almost 4k rows), it's likely inefficient to try to upsert all that data at once. I think this is especially important for us, since the embedding size can be big, and supabase might have some size limitations.
instead, can you add a batching functionality? basically, once the records
list has length x
(you can set this to your best guess for now), we will upsert that list, reset records
to an empty list, and then keep getting more data. this way we will only upsert at most x
records.
# python | ||
/adopt-an-inmate-venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also recommend adding a line for __pycache__
.
It's generally best to not add binaries to the code repo, especially if it is generated by a package manager like pip
, since it can become an attraction for merge conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a separate folder for Python code?
/actions
is generally used for server actions, which is a function of the application server.
In this case, this code seems more appropriate for testing or research. Perhaps /research
? Or if it is intended to be deployed and executed by an edge function, perhaps something like /edge
or /nlp
.
What's new in this PR
Description
created a python script in
actions/embeddings
to:public.adoptee
table in supabasebio
columnvecs.adoptee_vector
with the embedding and other adoptee data (id
,gender
,age
,state
,offense
,veteran_status
)created a sql query to copy the data from
vecs.adoptee_vector
intopublic.adoptee_vector
. the columns other thanid
andvec
are in json format, so the query extracts the information and places it into the corresponding column.runtime: the script runs pretty fast, maybe <20 seconds for setting up the clients, generating embeddings, and upsertting the data for 10 bios
Screenshots
public.adoptee_vector
table:How to review
DATABASE_URL
(in the passwords google doc)adopt-an-inmate-venv
and activate it.actions/embeddings/embed.py
there won't be any changes because all of the embeddings have been created and added to
vecs.adoptee_vector
. feel free to add additional data or edit existing data to check that the changes are reflected.to transfer the data to the
public
schema, run the sql querytransfer_data
Next steps
adoptee_vector
data is stored to a different schema than theadoptee
table. right now, i've created a sql query that transfers the data across schemas, but this will need to be revisited later. i wouldn't suggest keeping the data in thevecs
schema, as a good portion of the data is stored in json format due to restrictions with the vecs library. we could also consider keeping only theid
andvec
columns, as the other data can be retrieved viaid
frompublic.adoptee_vector
.Relevant links
Online sources
vecs documentation: https://supabase.github.io/vecs/
https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/semantic_text_deduplication.ipynb
Related PRs
CC: @ethan-tam33