Skip to content

Conversation

@liujch1998
Copy link
Contributor

@liujch1998 liujch1998 commented Apr 30, 2025

Importing functionalities such as find, count, prob, so that this API supports everything in infini-gram.

TODOs:

  • Set max and default values of query parameters
  • Support string query: tokenize and detokenize
  • check rate limit
  • Check with legal on document endpoints

Testing

$ python scripts/test_infini_gram_api.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.312s

OK

Calling attribution() on "1, 2, 3, 4, ..."

Before:

inputTokens: inputTokens: ['', '1', ',', ' ', '2', ',', ' ', '3', ',', ' ', ...]
spans:
l=0, r=86, count=1, text=1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
l=32, r=90, count=1, text=11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
...

After:

inputTokens: ['1', ',', ' ', '2', ',', ' ', '3', ',', ' ', ...]
spans:
l=0, r=85, count=1, text=1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
l=31, r=89, count=1, text=11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
...

Note that the empty token is gone at the beginning. As a result, span boundaries are shifted by one.

@liujch1998 liujch1998 requested a review from mtblanton April 30, 2025 17:27

documents_by_span = await asyncio.to_thread(
infini_gram_index.get_documents_by_pointers,
infini_gram_index.get_documents_by_pointers_grouped,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this function to give way to a regular version of get_documents_by_pointers

Comment on lines +75 to +82
self.MAX_QUERY_CHARS = 1000
self.MAX_QUERY_TOKENS = 500
self.MAX_CLAUSES_PER_CNF = 4
self.MAX_TERMS_PER_CLAUSE = 4
self.MAX_SUPPORT = 10000
self.MAX_CLAUSE_FREQ = 500000
self.MAX_DIFF_TOKENS = 1000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a stupid idea. Ideally these should be configurable w/o code change.

)

@tracer.start_as_current_span("infini_gram_processor/get_documents_by_pointers")
def get_documents_by_pointers(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now a regular version of get_documents_by_pointers

Comment on lines 49 to 51
# This is to fix a corner case: when input begins with a number, the token ids will begin with [29871 (whitespace), 29896, ...]
if len(encoded_query) > 0 and encoded_query[0] == 29871:
encoded_query = encoded_query[1:]
Copy link
Contributor Author

@liujch1998 liujch1998 May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will break anything. So far the tokenizer is only used by the attribution function to tokenize the model response. Fixing the leading whitespace is actually desired there, too.

Will test OLMoTrace with this.

Update: test added

pyproject.toml Outdated
{ path = "vendor/infini_gram-2.5.1-cp312-cp312-macosx_10_15_x86_64.whl", marker = "sys_platform == 'darwin' and python_version == '3.12' and platform_machine == 'x86_64'" },
{ path = "vendor/infini_gram-2.5.1-cp313-cp313-macosx_10_15_x86_64.whl", marker = "sys_platform == 'darwin' and python_version == '3.13' and platform_machine == 'x86_64'" },
{ path = "vendor/infini_gram-2.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", marker = "sys_platform == 'linux'" },
{ path = "vendor/infini_gram-2.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", marker = "sys_platform == 'linux'" },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change there is that I renamed get_docs_by_ptrs_2() to get_docs_by_ptrs_2_grouped() in EngineDiff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you published this version yet? We can stop fiddling with these if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm publishing it to pypi now. After that, what do we need to change here? I still want to pin to a specific version of infini-gram here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of the config to pull from a file and reference infini-gram like other packages in the processor's pyproject.toml!

dependencies = [
    "opentelemetry-api==1.30.0",
    "opentelemetry-sdk==1.30.0",
    "infini-gram==2.5.1",
    "transformers==4.49.0",
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!! i got rid of those file references, pinned to infini-gram==2.5.2

@liujch1998 liujch1998 marked this pull request as ready for review May 1, 2025 08:18
@liujch1998 liujch1998 changed the title Import other infini-gram functionalities here Support other infini-gram functionalities here May 2, 2025
)


@infinigram_router.post(path="/{index}/ntd")
Copy link
Contributor

@mtblanton mtblanton May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to explain what these endpoints do, especially for things like ntd that don't say what they are.

Suggested change
@infinigram_router.post(path="/{index}/ntd")
@infinigram_router.post(path="/{index}/ntd", description="<Describe what this ntd endpoint does>")

https://fastapi.tiangolo.com/tutorial/path-operation-configuration/#summary-and-description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added now!

if len(offset_mapping) > 1:
if offset_mapping[0][1] > offset_mapping[1][0]:
offset_mapping[0] = (offset_mapping[0][0], offset_mapping[1][0])
if len(tokenized_input.input_ids) > 0 and tokenized_input.input_ids[0] == 29871:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this token ID only apply to the specific version of the tokenizer we're using?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to the Llama-2 tokenizer only. Other tokenizers may behave differently, and requires customization more than changing this token ID.

I will add a comment about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a Mac version of this as well? We'll need it if any of us do local dev.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add :)

@liujch1998 liujch1998 requested a review from mtblanton May 6, 2025 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants