Skip to content

On error, the framework reloads the model even if it was already loaded #47

@mattany

Description

@mattany

in _run_single_test
we have the code

            model, tokenizer = load_model(
            self.model_path,
            num_gpus=self.num_gpus,
            device=self.device,
            debug=self.debug,
        )

This happens in a loop in generation_results:

        for attempt in range(max_retries):
            try:
                state = self._run_single_test()
                if state:
                    print(f"Test function successful on attempt {attempt + 1}")
                    return state
            except Exception as e:
                
                print(f"Test function failed on attempt {attempt + 1}")
                import traceback; traceback.print_exc();
                print(f"Retrying in {retry_interval} seconds...")
                time.sleep(retry_interval)

So if the model was already loaded and then there is an error, the model will be loaded again without clearing the memory, often causing OOM errors. The model should be stored in a class property so it can be accessed if it was already loaded into the GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions