You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(hf-inference): fork for hf-inference optim (overcommit) and widget compat
* Env var settings:
customize default num inference steps
default content type env var
default accept env var
Diffusers, txt2img (and img2img when supported), make sure guidance scale defaults to 0 when num steps <=4
* Content-type / accept / serialization fixes:
content type case ignore
fix: content-type and accept parsing, more flexibility than an exact string match since there can be some additional params
application/octet-stream support in content type deserialization, no reason not to accept it
fix: avoid returning none as a serializer, return an error instead
fix: de/serializer is not optional, do not support content type which we do not know what to do with
fix: explicit error message when no content-type is provided
* HF inference specificities
Multi task support + /pipeline/<task> support for api-inference backward compat
api inference compat responses
fix(api inference): compat for text-classification token-classification
fix: token classification api-inference-compat
fix: image segmentation on hf inference
zero shot classif: api inference compat
substitute /pipeline/sentence-embeddings to /pipeline/feature-extraction for sentence transformers
fix(api-inference): feature-extraction, flatten array, discard the batch size dim
feat(hf-inference): disable custom handler
* Build:
add timm hf_xet dependencies (for object detection, xethub support)
Dockerfile refacto: split requirements and source code layers, to optimize build time and enhance layer reuse
* Memory footprint + kick and respawn (primary memory gc)
feat(memory): reduce memory footprint on idle service
backported and adapted from
https://github.com/huggingface/api-inference-community/blob/main/docker_images/diffusers/app/idle.py
1. adding gunicorn instead of uvicorn to allow for wsgi/asgi workers to easily be suppressed when idle whithout stopping the entire service
-> easy way to release memory whithout digging into the depth of the imported modules
2. memory consuming libs lazy load (transformers, diffusers, sentence_transformers)
3. pipeline lazy load as well
The first 'cold start' request tends to be a bit slower than others but the footprint is reduced to the minimum when idle
0 commit comments