-
Couldn't load subscription status.
- Fork 10
Support other infini-gram functionalities here #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| documents_by_span = await asyncio.to_thread( | ||
| infini_gram_index.get_documents_by_pointers, | ||
| infini_gram_index.get_documents_by_pointers_grouped, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this function to give way to a regular version of get_documents_by_pointers
| self.MAX_QUERY_CHARS = 1000 | ||
| self.MAX_QUERY_TOKENS = 500 | ||
| self.MAX_CLAUSES_PER_CNF = 4 | ||
| self.MAX_TERMS_PER_CLAUSE = 4 | ||
| self.MAX_SUPPORT = 10000 | ||
| self.MAX_CLAUSE_FREQ = 500000 | ||
| self.MAX_DIFF_TOKENS = 1000 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a stupid idea. Ideally these should be configurable w/o code change.
| ) | ||
|
|
||
| @tracer.start_as_current_span("infini_gram_processor/get_documents_by_pointers") | ||
| def get_documents_by_pointers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now a regular version of get_documents_by_pointers
| # This is to fix a corner case: when input begins with a number, the token ids will begin with [29871 (whitespace), 29896, ...] | ||
| if len(encoded_query) > 0 and encoded_query[0] == 29871: | ||
| encoded_query = encoded_query[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will break anything. So far the tokenizer is only used by the attribution function to tokenize the model response. Fixing the leading whitespace is actually desired there, too.
Will test OLMoTrace with this.
Update: test added
pyproject.toml
Outdated
| { path = "vendor/infini_gram-2.5.1-cp312-cp312-macosx_10_15_x86_64.whl", marker = "sys_platform == 'darwin' and python_version == '3.12' and platform_machine == 'x86_64'" }, | ||
| { path = "vendor/infini_gram-2.5.1-cp313-cp313-macosx_10_15_x86_64.whl", marker = "sys_platform == 'darwin' and python_version == '3.13' and platform_machine == 'x86_64'" }, | ||
| { path = "vendor/infini_gram-2.5.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", marker = "sys_platform == 'linux'" }, | ||
| { path = "vendor/infini_gram-2.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", marker = "sys_platform == 'linux'" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change there is that I renamed get_docs_by_ptrs_2() to get_docs_by_ptrs_2_grouped() in EngineDiff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you published this version yet? We can stop fiddling with these if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm publishing it to pypi now. After that, what do we need to change here? I still want to pin to a specific version of infini-gram here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of the config to pull from a file and reference infini-gram like other packages in the processor's pyproject.toml!
dependencies = [
"opentelemetry-api==1.30.0",
"opentelemetry-sdk==1.30.0",
"infini-gram==2.5.1",
"transformers==4.49.0",
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!! i got rid of those file references, pinned to infini-gram==2.5.2
| ) | ||
|
|
||
|
|
||
| @infinigram_router.post(path="/{index}/ntd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to explain what these endpoints do, especially for things like ntd that don't say what they are.
| @infinigram_router.post(path="/{index}/ntd") | |
| @infinigram_router.post(path="/{index}/ntd", description="<Describe what this ntd endpoint does>") |
https://fastapi.tiangolo.com/tutorial/path-operation-configuration/#summary-and-description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added now!
| if len(offset_mapping) > 1: | ||
| if offset_mapping[0][1] > offset_mapping[1][0]: | ||
| offset_mapping[0] = (offset_mapping[0][0], offset_mapping[1][0]) | ||
| if len(tokenized_input.input_ids) > 0 and tokenized_input.input_ids[0] == 29871: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this token ID only apply to the specific version of the tokenizer we're using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It applies to the Llama-2 tokenizer only. Other tokenizers may behave differently, and requires customization more than changing this token ID.
I will add a comment about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a Mac version of this as well? We'll need it if any of us do local dev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will add :)
Importing functionalities such as
find,count,prob, so that this API supports everything in infini-gram.TODOs:
Testing
Calling attribution() on "1, 2, 3, 4, ..."
Before:
After:
Note that the empty token is gone at the beginning. As a result, span boundaries are shifted by one.