-
Notifications
You must be signed in to change notification settings - Fork 23
api inference mini fork #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b93b802 to
7f17bb6
Compare
| if default_num_steps: | ||
| kwargs["num_inference_steps"] = int(default_num_steps) | ||
|
|
||
| if "guidance_scale" not in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useful for sd 3.5 turbo -> we want guidance scale 0 by default (e.g when not specified by user) because the num steps is too low, so that generated images are ok
| ) | ||
|
|
||
|
|
||
| def api_inference_compat(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this env var we intend to handle the small response differences between the api inference widgets on the hub and on endpoints ui. TODO: we should probably unify both widgets instead
| Route("/predict", predict, methods=["POST"]), | ||
| Route("/metrics", metrics, methods=["GET"]), | ||
| ] | ||
| if api_inference_compat(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only activated multi task for api inference (as a test) but we may want to remove this condition and just always support it if we're satisified with it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually thinking about it -> we may want a separate env var (and keep deactivated by default for regular users, and provide an option for it in endpoints instead) because the pod may consume more ram than expected (due to the pipeline duplications) with this route
459f3b6 to
39db7c6
Compare
c71a4c5 to
c5565c2
Compare
67491ac to
0818705
Compare
b5fa0ea to
bb2a6c3
Compare
4673dcb to
aa26602
Compare
…et compat * Env var settings: customize default num inference steps default content type env var default accept env var Diffusers, txt2img (and img2img when supported), make sure guidance scale defaults to 0 when num steps <=4 * Content-type / accept / serialization fixes: content type case ignore fix: content-type and accept parsing, more flexibility than an exact string match since there can be some additional params application/octet-stream support in content type deserialization, no reason not to accept it fix: avoid returning none as a serializer, return an error instead fix: de/serializer is not optional, do not support content type which we do not know what to do with fix: explicit error message when no content-type is provided * HF inference specificities Multi task support + /pipeline/<task> support for api-inference backward compat api inference compat responses fix(api inference): compat for text-classification token-classification fix: token classification api-inference-compat fix: image segmentation on hf inference zero shot classif: api inference compat substitute /pipeline/sentence-embeddings to /pipeline/feature-extraction for sentence transformers fix(api-inference): feature-extraction, flatten array, discard the batch size dim feat(hf-inference): disable custom handler * Build: add timm hf_xet dependencies (for object detection, xethub support) Dockerfile refacto: split requirements and source code layers, to optimize build time and enhance layer reuse * Memory footprint + kick and respawn (primary memory gc) feat(memory): reduce memory footprint on idle service backported and adapted from https://github.com/huggingface/api-inference-community/blob/main/docker_images/diffusers/app/idle.py 1. adding gunicorn instead of uvicorn to allow for wsgi/asgi workers to easily be suppressed when idle whithout stopping the entire service -> easy way to release memory whithout digging into the depth of the imported modules 2. memory consuming libs lazy load (transformers, diffusers, sentence_transformers) 3. pipeline lazy load as well The first 'cold start' request tends to be a bit slower than others but the footprint is reduced to the minimum when idle
935e4f4 to
c0a0e42
Compare
…nswer anymore* When behind a proxy this requires the proxy to close the connection to be effective though Signed-off-by: Raphael Glon <[email protected]>
* environment log level var * some long blocking sync calls should be wrapped in a thread (model download) * idle check should be done for the entire predict call otherwise in non idle mode the worker could be kicked in the middle of a request Signed-off-by: Raphael Glon <[email protected]>
1a22ea5 to
54d2596
Compare
Uh oh!
There was an error while loading. Please reload this page.