- 
                Notifications
    You must be signed in to change notification settings 
- Fork 31k
Description
System Info
Model runs on CPU+RAM. Ryzen 6800H, 14 GB DDR5, Host system: Fedora 41, Docker base image: python:3.13.7
FastAPI microservice is deployed in docker with these requirements:
fastapi==0.117.1
huggingface-hub==0.35.1
numpy==2.3.3
pillow==11.3.0
pydantic==2.11.9
python-multipart==0.0.20
requests==2.32.4
sentencepiece==0.2.1
torch==2.8.0
torchvision==0.23.0
transformers==4.56.2
uvicorn==0.37.0
transformers env:
embedding-service       | - transformers version: 4.56.2
embedding-service       | - Platform: Linux-6.16.7-100.fc41.x86_64-x86_64-with-glibc2.41
embedding-service       | - Python version: 3.13.7
embedding-service       | - Huggingface_hub version: 0.35.1
embedding-service       | - Safetensors version: 0.6.2
embedding-service       | - Accelerate version: not installed
embedding-service       | - Accelerate config: not found
embedding-service       | - DeepSpeed version: not installed
embedding-service       | - PyTorch version (accelerator?): 2.8.0+cu128 (NA)
embedding-service       | - Tensorflow version (GPU?): not installed (NA)
embedding-service       | - Flax version (CPU?/GPU?/TPU?): not installed (NA)
embedding-service       | - Jax version: not installed
embedding-service       | - JaxLib version: not installed
embedding-service       | - Using distributed or parallel set-up in script?: (I'm not sure tbh...)
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
-  An officially supported task in the examplesfolder (such as GLUE/SQuAD, ...)
- My own task or dataset (give details below)
Reproduction
- Use CPU+RAM
- UPD: Add at least 1000 images to datafolder
- Git clone this repository: https://github.com/Adefey/search_dir, checkout to this commit: d8a40baa434c4b00ad04dada9d7221edb111f4aa
- Run: docker compose up --build
- Call API: POST http://localhost:8003/api/v1/start_discovery
- Monitor logs, depending on available system memory embedding-service may get OOM (embedding-service exited with code 137) after some batches (on my system with 14 GB ram it's 10 batches)
 But actually there are only 3 methods (__init__,_encode,encode_images) that may be relevant. Repeated calls ofencode_imagesresult in OOM. I'll show relevant methods here:
def __init__(self):
        self.model_checkpoint = "openai/clip-vit-base-patch32"
        os.system("transformers env")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"Start setting up model {self.model_checkpoint} on {self.device}")
        self.model = AutoModel.from_pretrained(self.model_checkpoint).to(self.device)
        self.processor = AutoProcessor.from_pretrained(self.model_checkpoint, use_fast=False)
        logger.info(f"Finished setting up model {self.model_checkpoint} on {self.device}")
def _encode(self, inputs: dict) -> list[float]:
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        if "pixel_values" in inputs:
            features = self.model.get_image_features(**inputs)
        else:
            features = self.model.get_text_features(**inputs)
        result = features.cpu().detach().numpy().tolist()
        del inputs
        del features
        if self.device == "cuda":
            torch.cuda.empty_cache()
        # ??????????
        # trim_memory()
        return result
def encode_images(self, images: list[bytes]) -> list[list[float]]:
        """
        Process images into embeddings
        """
        logger.info(f"Start encoding images")
        image_list = [Image.open(io.BytesIO(image)) for image in images]
        with torch.inference_mode():
            inputs = self.processor(
                images=image_list,
                return_tensors="pt",
                padding=True,
            )
            result = self._encode(inputs)
        for image in image_list:
            image.close()
        logger.info(f"Finished encoding images")
        return result
Also, there is a working fix: calling trim_memory() after each model call:
def trim_memory():
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)
But I think this is a workaround and transformers library should manage resources correctly on its own.
Expected behavior
Usecase: microservice with Model gets batches of 30 images to calculate embeddings, then it gets another batch. After 10 batches service is killed because of OOM. Manually monitoring memory in htop - usage increases with every batch by 600...800 MB. Expected behavior - constant memory usage for at every time during batch processing.
I suppose there is a memory leak or memory fragmentation issue where new memory keeps being allocated and not reused.
UPD: exactly same OOM issue happens with google/siglip2-base-patch16-256 and again malloc_trim workaround works