Support other infini-gram functionalities here #108

liujch1998 · 2025-04-30T17:26:56Z

Importing functionalities such as find, count, prob, so that this API supports everything in infini-gram.

TODOs:

Set max and default values of query parameters
Support string query: tokenize and detokenize
check rate limit
Check with legal on document endpoints

Testing

$ python scripts/test_infini_gram_api.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.312s

OK

Calling attribution() on "1, 2, 3, 4, ..."

Before:

inputTokens: inputTokens: ['', '1', ',', ' ', '2', ',', ' ', '3', ',', ' ', ...]
spans:
l=0, r=86, count=1, text=1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
l=32, r=90, count=1, text=11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
...

After:

inputTokens: ['1', ',', ' ', '2', ',', ' ', '3', ',', ' ', ...]
spans:
l=0, r=85, count=1, text=1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
l=31, r=89, count=1, text=11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
...

Note that the empty token is gone at the beginning. As a result, span boundaries are shifted by one.

liujch1998 · 2025-05-01T08:11:34Z

attribution_worker/worker.py


        documents_by_span = await asyncio.to_thread(
-            infini_gram_index.get_documents_by_pointers,
+            infini_gram_index.get_documents_by_pointers_grouped,


I renamed this function to give way to a regular version of get_documents_by_pointers

liujch1998 · 2025-05-01T08:12:39Z

packages/infini-gram-processor/src/infini_gram_processor/processor.py

+        self.MAX_QUERY_CHARS = 1000
+        self.MAX_QUERY_TOKENS = 500
+        self.MAX_CLAUSES_PER_CNF = 4
+        self.MAX_TERMS_PER_CLAUSE = 4
+        self.MAX_SUPPORT = 10000
+        self.MAX_CLAUSE_FREQ = 500000
+        self.MAX_DIFF_TOKENS = 1000
+


Probably a stupid idea. Ideally these should be configurable w/o code change.

liujch1998 · 2025-05-01T08:13:55Z

packages/infini-gram-processor/src/infini_gram_processor/processor.py

        )

    @tracer.start_as_current_span("infini_gram_processor/get_documents_by_pointers")
    def get_documents_by_pointers(


this is now a regular version of get_documents_by_pointers

liujch1998 · 2025-05-01T08:16:29Z

packages/infini-gram-processor/src/infini_gram_processor/tokenizers/tokenizer.py

+        # This is to fix a corner case: when input begins with a number, the token ids will begin with [29871 (whitespace), 29896, ...]
+        if len(encoded_query) > 0 and encoded_query[0] == 29871:
+            encoded_query = encoded_query[1:]


I don't think this will break anything. So far the tokenizer is only used by the attribution function to tokenize the model response. Fixing the leading whitespace is actually desired there, too.

Will test OLMoTrace with this.

Update: test added

liujch1998 · 2025-05-01T08:17:36Z

pyproject.toml

    { path = "vendor/infini_gram-2.5.1-cp312-cp312-macosx_10_15_x86_64.whl", marker = "sys_platform == 'darwin' and python_version == '3.12' and platform_machine == 'x86_64'" },
    { path = "vendor/infini_gram-2.5.1-cp313-cp313-macosx_10_15_x86_64.whl", marker = "sys_platform == 'darwin' and python_version == '3.13' and platform_machine == 'x86_64'" },
-    { path = "vendor/infini_gram-2.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", marker = "sys_platform == 'linux'" },
+    { path = "vendor/infini_gram-2.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", marker = "sys_platform == 'linux'" },


The only change there is that I renamed get_docs_by_ptrs_2() to get_docs_by_ptrs_2_grouped() in EngineDiff

Have you published this version yet? We can stop fiddling with these if so.

I'm publishing it to pypi now. After that, what do we need to change here? I still want to pin to a specific version of infini-gram here.

We can get rid of the config to pull from a file and reference infini-gram like other packages in the processor's pyproject.toml!

dependencies = [ "opentelemetry-api==1.30.0", "opentelemetry-sdk==1.30.0", "infini-gram==2.5.1", "transformers==4.49.0", ]

nice!! i got rid of those file references, pinned to infini-gram==2.5.2

mtblanton · 2025-05-06T18:53:54Z

api/src/infinigram/infinigram_router.py

+    )
+
+
+@infinigram_router.post(path="/{index}/ntd")


It'd be good to explain what these endpoints do, especially for things like ntd that don't say what they are.

Suggested change

@infinigram_router.post(path="/{index}/ntd")

@infinigram_router.post(path="/{index}/ntd", description="<Describe what this ntd endpoint does>")

https://fastapi.tiangolo.com/tutorial/path-operation-configuration/#summary-and-description

mtblanton · 2025-05-06T18:56:23Z

packages/infini-gram-processor/src/infini_gram_processor/tokenizers/tokenizer.py

-        if len(offset_mapping) > 1:
-            if offset_mapping[0][1] > offset_mapping[1][0]:
-                offset_mapping[0] = (offset_mapping[0][0], offset_mapping[1][0])
+        if len(tokenized_input.input_ids) > 0 and tokenized_input.input_ids[0] == 29871:


Does this token ID only apply to the specific version of the tokenizer we're using?

It applies to the Llama-2 tokenizer only. Other tokenizers may behave differently, and requires customization more than changing this token ID.

I will add a comment about this.

mtblanton · 2025-05-06T18:58:31Z

vendor/infini_gram-2.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Do you have a Mac version of this as well? We'll need it if any of us do local dev.

Yes, will add :)

Import other infini-gram functionalities here

926e471

liujch1998 requested a review from mtblanton April 30, 2025 17:27

Support string query; Set max and default values of query parameters

40ff4e5

liujch1998 commented May 1, 2025

View reviewed changes

liujch1998 marked this pull request as ready for review May 1, 2025 08:18

Jiacheng Liu added 2 commits May 1, 2025 22:22

Add tests to API & other misc scripts

c375492

Sync tokenize_to_list() with tokenize()

901430d

liujch1998 changed the title ~~Import other infini-gram functionalities here~~ Support other infini-gram functionalities here May 2, 2025

mtblanton reviewed May 6, 2025

View reviewed changes

Address comments

e5d21b1

liujch1998 requested a review from mtblanton May 6, 2025 20:32

liujch1998 added 6 commits May 16, 2025 18:48

Add more doc endpoints

350817a

Fix doc endpoints

37d0630

Fix doc endpoints

5ad362c

Fix doc endpoints

e60fced

Fix doc endpoints

79ac39f

Use infini-gram in pypi

42cd9d0

	@infinigram_router.post(path="/{index}/ntd")
	@infinigram_router.post(path="/{index}/ntd", description="<Describe what this ntd endpoint does>")

Uh oh!

Support other infini-gram functionalities here #108

Are you sure you want to change the base?

Support other infini-gram functionalities here #108

Uh oh!

Conversation

liujch1998 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Calling attribution() on "1, 2, 3, 4, ..."

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liujch1998 May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtblanton May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liujch1998 commented Apr 30, 2025 •

edited

Loading

liujch1998 May 1, 2025 •

edited

Loading

mtblanton May 6, 2025 •

edited

Loading