Skip to content

Conversation

Zhimin-arya
Copy link
Collaborator

  1. Use cpu-only torch package
  2. Mount linker folder to prefect volume instead of downloading them during image build.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the data_transformation Docker image by reducing its size through two main changes: switching to a CPU-only PyTorch package and removing the download of 2GB linker files during the build process.

  • Replaces standard PyTorch with CPU-only version to reduce image size
  • Removes downloading of large linker files (~2GB) from the Docker build process
  • Provides documentation for mounting linker files via Prefect volumes instead

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
requirements_data_transformation.txt Adds CPU-only PyTorch wheel package
ner_extract_plugin/README.md Adds instructions for downloading linkers locally and mounting via Prefect
Dockerfile Removes linker download commands and unused COPY instruction

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

## Instruction for using ner_extract_plugin:
### 1. Download model linkers (~2GB) to user local folder
```
output_dir="path/to/linker/foler" && \
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in path: 'foler' should be 'folder'.

Copilot uses AI. Check for mistakes.

## Instruction for using ner_extract_plugin:
### 1. Download model linkers (~2GB) to user local folder
```
output_dir="path/to/linker/folder" && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the user update "output_dir" and then run the command or just run as is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a linker folder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User needs to update the output_dir

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linker is necessary for using different knowledge base for each entity extraction recognition models.

```

### 2. Mount the folder to Prefect Volume
In the `.env` file add `"path/to/linker/foler:/root/.scispacy/datasets"` to `PREFECT_DOCKER_VOLUMES_CUSTOM` variable
Copy link
Collaborator

@csafreen csafreen Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scipy==1.16.0
# For CPU-only PyTorch
https://download.pytorch.org/whl/cpu/torch-2.8.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the image size now after removing model linkers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current image size after removing model linker and using cpu-only torch python package, is 8.3 GB.

@Zhimin-arya Zhimin-arya linked an issue Sep 16, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce size of data transformation docker image
2 participants