Skip to content

Conversation

@papa99do
Copy link
Collaborator

@papa99do papa99do commented Aug 18, 2025

Change Summary

The queries for the base64 search in image are from a fixed set of possible patches so it should be very cache friendly. In this PR, we provide a solution to cache the embeddings of a base64 encoded image. We use the BLAKE3 hash function to generate a hash based on the encoded image, to be used as the key of the cached item. Collisions are unlikely. we tradeoff memory space with some extra CPU time.

Related Jira Ticket

https://s2search.atlassian.net/browse/MOSD-443

Checklist

  • Tests have been added for changes
  • [N/A] Documentation has been updated
  • [N/A] Breaking changes are clearly identified
  • [N/A] Python client changes linked or N/A

For new field types:

  • Tests cover score modifier usage of this new type
  • Test indexes updated to cover the new type for all APIs (add docs, search, partial update, etc.)

cursor[bot]

This comment was marked as outdated.

@papa99do papa99do mentioned this pull request Aug 19, 2025
2 tasks
@papa99do papa99do requested a review from wanliAlex August 19, 2025 08:24
wanliAlex
wanliAlex previously approved these changes Aug 21, 2025
@papa99do
Copy link
Collaborator Author

The perf test result for blake3 hashing

Test result on g4dn.xlarge host

algo       mode       size          bytes     iters  time(s)     GB/s
----------------------------------------------------------------------
blake3     bytes      10 KB    4,511,416,320   440,568     2.00     2.26
blake3     str→bytes  10 KB    3,897,876,480   380,652     2.00     1.95
md5        bytes      10 KB    1,087,610,880   106,212     2.00     0.54
md5        str→bytes  10 KB    1,171,251,200   114,380     2.00     0.59
sha256     bytes      10 KB    783,933,440    76,556     2.00     0.39
sha256     str→bytes  10 KB    772,372,480    75,427     2.00     0.39
blake2b    bytes      10 KB    1,151,518,720   112,453     2.00     0.58
blake2b    str→bytes  10 KB    1,128,253,440   110,181     2.00     0.56
blake3     bytes      100 KB   10,055,680,000    98,200     2.00     5.03
blake3     str→bytes  100 KB   8,751,104,000    85,460     2.00     4.38
md5        bytes      100 KB   1,267,609,600    12,379     2.00     0.63
md5        str→bytes  100 KB   1,242,419,200    12,133     2.00     0.62
sha256     bytes      100 KB   815,104,000     7,960     2.00     0.41
sha256     str→bytes  100 KB   804,556,800     7,857     2.00     0.40
blake2b    bytes      100 KB   1,154,150,400    11,271     2.00     0.58
blake2b    str→bytes  100 KB   1,157,632,000    11,305     2.00     0.58
blake3     bytes      1 MB     11,110,711,296    10,596     2.00     5.56
blake3     str→bytes  1 MB     7,893,680,128     7,528     2.00     3.95
md5        bytes      1 MB     1,272,971,264     1,214     2.00     0.64
md5        str→bytes  1 MB     1,218,445,312     1,162     2.00     0.61
sha256     bytes      1 MB     817,889,280       780     2.00     0.41
sha256     str→bytes  1 MB     794,820,608       758     2.00     0.40
blake2b    bytes      1 MB     1,168,113,664     1,114     2.00     0.58
blake2b    str→bytes  1 MB     1,131,413,504     1,079     2.00     0.57

wanliAlex
wanliAlex previously approved these changes Aug 21, 2025
Copy link
Collaborator

@wanliAlex wanliAlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@papa99do papa99do merged commit daa7aec into mainline Sep 1, 2025
48 of 50 checks passed
@papa99do papa99do deleted the yihan/base64-cache branch September 1, 2025 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants