api inference mini fork #109

oOraph · 2025-04-16T08:21:38Z

Possibility to override some inference params (related to diffusion) so that the default inference is ok when user does not specify any such params
Multi task support with one deployment (example: sentence-similarity + sentence-embeddings)
api-inference compat env var
memory footprint reducing: lazy imports of memory greedy libs (transformers) + uvicorn replaced by gunicorn (to fix the kick and run trick we use on idle workers ~ respawn self killed workers -> most immediate method we found to release memory without digging in all the different lib / alloc mechanics)

oOraph · 2025-05-05T07:52:30Z

src/huggingface_inference_toolkit/diffusers_utils.py

+            if default_num_steps:
+                kwargs["num_inference_steps"] = int(default_num_steps)
+
+        if "guidance_scale" not in kwargs:


useful for sd 3.5 turbo -> we want guidance scale 0 by default (e.g when not specified by user) because the num steps is too low, so that generated images are ok

oOraph · 2025-05-05T07:54:13Z

src/huggingface_inference_toolkit/env_utils.py

    )
+
+
+def api_inference_compat():


with this env var we intend to handle the small response differences between the api inference widgets on the hub and on endpoints ui. TODO: we should probably unify both widgets instead

oOraph · 2025-05-05T07:58:52Z

src/huggingface_inference_toolkit/webservice_starlette.py

+        Route("/predict", predict, methods=["POST"]),
+        Route("/metrics", metrics, methods=["GET"]),
+    ]
+    if api_inference_compat():


I only activated multi task for api inference (as a test) but we may want to remove this condition and just always support it if we're satisified with it)

actually thinking about it -> we may want a separate env var (and keep deactivated by default for regular users, and provide an option for it in endpoints instead) because the pod may consume more ram than expected (due to the pipeline duplications) with this route

…et compat * Env var settings: customize default num inference steps default content type env var default accept env var Diffusers, txt2img (and img2img when supported), make sure guidance scale defaults to 0 when num steps <=4 * Content-type / accept / serialization fixes: content type case ignore fix: content-type and accept parsing, more flexibility than an exact string match since there can be some additional params application/octet-stream support in content type deserialization, no reason not to accept it fix: avoid returning none as a serializer, return an error instead fix: de/serializer is not optional, do not support content type which we do not know what to do with fix: explicit error message when no content-type is provided * HF inference specificities Multi task support + /pipeline/<task> support for api-inference backward compat api inference compat responses fix(api inference): compat for text-classification token-classification fix: token classification api-inference-compat fix: image segmentation on hf inference zero shot classif: api inference compat substitute /pipeline/sentence-embeddings to /pipeline/feature-extraction for sentence transformers fix(api-inference): feature-extraction, flatten array, discard the batch size dim feat(hf-inference): disable custom handler * Build: add timm hf_xet dependencies (for object detection, xethub support) Dockerfile refacto: split requirements and source code layers, to optimize build time and enhance layer reuse * Memory footprint + kick and respawn (primary memory gc) feat(memory): reduce memory footprint on idle service backported and adapted from https://github.com/huggingface/api-inference-community/blob/main/docker_images/diffusers/app/idle.py 1. adding gunicorn instead of uvicorn to allow for wsgi/asgi workers to easily be suppressed when idle whithout stopping the entire service -> easy way to release memory whithout digging into the depth of the imported modules 2. memory consuming libs lazy load (transformers, diffusers, sentence_transformers) 3. pipeline lazy load as well The first 'cold start' request tends to be a bit slower than others but the footprint is reduced to the minimum when idle

…nswer anymore* When behind a proxy this requires the proxy to close the connection to be effective though Signed-off-by: Raphael Glon <[email protected]>

* environment log level var * some long blocking sync calls should be wrapped in a thread (model download) * idle check should be done for the entire predict call otherwise in non idle mode the worker could be kicked in the middle of a request Signed-off-by: Raphael Glon <[email protected]>

oOraph changed the title ~~Dev/api inference mini fork~~ api inference mini fork Apr 17, 2025

oOraph force-pushed the dev/api-inference-mini-fork branch 5 times, most recently from b93b802 to 7f17bb6 Compare May 2, 2025 14:05

oOraph commented May 5, 2025

View reviewed changes

oOraph requested review from alvarobartt and co42 May 5, 2025 07:57

oOraph commented May 5, 2025

View reviewed changes

oOraph force-pushed the dev/api-inference-mini-fork branch 5 times, most recently from 459f3b6 to 39db7c6 Compare May 9, 2025 09:27

oOraph force-pushed the dev/api-inference-mini-fork branch from c71a4c5 to c5565c2 Compare May 13, 2025 16:16

oOraph force-pushed the dev/api-inference-mini-fork branch 3 times, most recently from 67491ac to 0818705 Compare May 23, 2025 12:53

oOraph force-pushed the dev/api-inference-mini-fork branch 2 times, most recently from b5fa0ea to bb2a6c3 Compare June 11, 2025 13:17

oOraph requested a review from XciD June 12, 2025 08:15

oOraph force-pushed the dev/api-inference-mini-fork branch from 4673dcb to aa26602 Compare August 29, 2025 15:14

oOraph force-pushed the dev/api-inference-mini-fork branch 2 times, most recently from 935e4f4 to c0a0e42 Compare September 19, 2025 07:56

oOraph added 2 commits September 19, 2025 12:20

feat(relieve): discard request if the caller is not waiting for the a…

52511c0

…nswer anymore* When behind a proxy this requires the proxy to close the connection to be effective though Signed-off-by: Raphael Glon <[email protected]>

oOraph force-pushed the dev/api-inference-mini-fork branch from 1a22ea5 to 54d2596 Compare November 13, 2025 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

api inference mini fork #109

api inference mini fork #109

Uh oh!

oOraph commented Apr 16, 2025 •

edited

Loading

Uh oh!

oOraph May 5, 2025

Uh oh!

oOraph May 5, 2025

Uh oh!

oOraph May 5, 2025 •

edited

Loading

Uh oh!

oOraph May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		)


		def api_inference_compat():

api inference mini fork #109

Are you sure you want to change the base?

api inference mini fork #109

Uh oh!

Conversation

oOraph commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oOraph May 5, 2025

Choose a reason for hiding this comment

Uh oh!

oOraph May 5, 2025

Choose a reason for hiding this comment

Uh oh!

oOraph May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oOraph May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oOraph commented Apr 16, 2025 •

edited

Loading

oOraph May 5, 2025 •

edited

Loading