Skip to content

Conversation

@oOraph
Copy link
Contributor

@oOraph oOraph commented Apr 16, 2025

  • Possibility to override some inference params (related to diffusion) so that the default inference is ok when user does not specify any such params
  • Multi task support with one deployment (example: sentence-similarity + sentence-embeddings)
  • api-inference compat env var
  • memory footprint reducing: lazy imports of memory greedy libs (transformers) + uvicorn replaced by gunicorn (to fix the kick and run trick we use on idle workers ~ respawn self killed workers -> most immediate method we found to release memory without digging in all the different lib / alloc mechanics)

@oOraph oOraph changed the title Dev/api inference mini fork api inference mini fork Apr 17, 2025
@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch 5 times, most recently from b93b802 to 7f17bb6 Compare May 2, 2025 14:05
if default_num_steps:
kwargs["num_inference_steps"] = int(default_num_steps)

if "guidance_scale" not in kwargs:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useful for sd 3.5 turbo -> we want guidance scale 0 by default (e.g when not specified by user) because the num steps is too low, so that generated images are ok

)


def api_inference_compat():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this env var we intend to handle the small response differences between the api inference widgets on the hub and on endpoints ui. TODO: we should probably unify both widgets instead

@oOraph oOraph requested review from alvarobartt and co42 May 5, 2025 07:57
Route("/predict", predict, methods=["POST"]),
Route("/metrics", metrics, methods=["GET"]),
]
if api_inference_compat():
Copy link
Contributor Author

@oOraph oOraph May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only activated multi task for api inference (as a test) but we may want to remove this condition and just always support it if we're satisified with it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually thinking about it -> we may want a separate env var (and keep deactivated by default for regular users, and provide an option for it in endpoints instead) because the pod may consume more ram than expected (due to the pipeline duplications) with this route

@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch 5 times, most recently from 459f3b6 to 39db7c6 Compare May 9, 2025 09:27
@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch from c71a4c5 to c5565c2 Compare May 13, 2025 16:16
@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch 3 times, most recently from 67491ac to 0818705 Compare May 23, 2025 12:53
@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch 2 times, most recently from b5fa0ea to bb2a6c3 Compare June 11, 2025 13:17
@oOraph oOraph requested a review from XciD June 12, 2025 08:15
@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch from 4673dcb to aa26602 Compare August 29, 2025 15:14
…et compat

* Env var settings:

customize default num inference steps
default content type env var
default accept env var
Diffusers, txt2img (and img2img when supported), make sure guidance scale defaults to 0 when num steps <=4

* Content-type / accept / serialization fixes:

content type case ignore
fix: content-type and accept parsing, more flexibility than an exact string match since there can be some additional params
application/octet-stream support in content type deserialization, no reason not to accept it
fix: avoid returning none as a serializer, return an error instead
fix: de/serializer is not optional, do not support content type which we do not know what to do with
fix: explicit error message when no content-type is provided

* HF inference specificities

Multi task support + /pipeline/<task> support for api-inference backward compat
api inference compat responses
fix(api inference): compat for text-classification token-classification
fix: token classification api-inference-compat
fix: image segmentation on hf inference
zero shot classif: api inference compat
substitute /pipeline/sentence-embeddings to /pipeline/feature-extraction for sentence transformers
fix(api-inference): feature-extraction, flatten array, discard the batch size dim
feat(hf-inference): disable custom handler

* Build:
add timm hf_xet dependencies (for object detection, xethub support)
Dockerfile refacto: split requirements and source code layers, to optimize build time and enhance layer reuse

* Memory footprint + kick and respawn (primary memory gc)

feat(memory): reduce memory footprint on idle service
backported and adapted from
https://github.com/huggingface/api-inference-community/blob/main/docker_images/diffusers/app/idle.py
1. adding gunicorn instead of uvicorn to allow for wsgi/asgi workers to easily be suppressed when idle whithout stopping the entire service
-> easy way to release memory whithout digging into the depth of the imported modules
2. memory consuming libs lazy load (transformers, diffusers, sentence_transformers)
3. pipeline lazy load as well
The first 'cold start' request tends to be a bit slower than others but the footprint is reduced to the minimum when idle
@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch 2 times, most recently from 935e4f4 to c0a0e42 Compare September 19, 2025 07:56
…nswer anymore*

When behind a proxy this requires the proxy to close the connection to be effective though

Signed-off-by: Raphael Glon <[email protected]>
* environment log level var
* some long blocking sync calls should be wrapped in a thread (model download)
* idle check should be done for the entire predict call otherwise in non idle mode the worker could be kicked in the middle of a request

Signed-off-by: Raphael Glon <[email protected]>
@oOraph oOraph force-pushed the dev/api-inference-mini-fork branch from 1a22ea5 to 54d2596 Compare November 13, 2025 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants